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ABSTRACT 



A modular architecture for storing, addressing and retrieving 
graphics data from main memory instead of expensive local 
frame buffer memory. A graphic address remapping table 
(GART), defined in software, is used to remap virtual 
addresses falling within a selected range, the GART range, 
to non-contiguous pages in main memory. Virtual address 
not within the selected range are passed without modifica- 
tion. The GART includes page table entries (PTEs) having 
translation information to remap virtual addresses falling 
within the GART range to their corresponding physical 
addresses. The GART PTEs are of configurable length 
enabling optimization of GART size and the use of feature 
bits, such as status indicators, defined by software. The 
GART is implemented during system boot up by configu- 
ration registers. Similarly, the PTEs are configured using 
mask registers. The GART may be used in conjunction with 
a translation lookaside buffer (TLB) to improve address 
remapping performance. 

18 Claims, 13 Drawing Sheets 
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APPARATUS COMPRISING A in 16-bit color at 30 frames/sec at 75 Hz screen refresh, 

TRANSLATION LOOKASIDE BUFFER FOR estimated bandwidths of 370 megabytes/sec to 840 

GRAPHICS ADDRESS REMAPPING OF megabytes/sec are needed for screen resolutions from 640x 

VIRTUAL ADDRESSES resolution (VGA) to 1024x768 resolution (XGA). Thus, 

5 rendering of 3D graphics on a display requires a Large 

CROSS REFERENCE TO RELATED amount of bandwidth between the graphics accelerator 110 

APPLICATIONS and the local frame buffer 112, where 3D texture maps and 

z-buffer data typically reside. 

This is a division of, and incorporates by reference in its Tq addition> many comp uter systems use virtual memory 
entirety, U.S. application Ser. No. 08/882,054, now U.S. Pat, systems t0 per mit the processor 102 to address more 
No. 6,249,853, tided "Apparatus for Graphic Address memory than is physically present in the main memory 106. 
Remapping", filed Jun. 25, 1997. This application is related A yirlual memory system ^lows addressing of very large 
to, and incorporates by reference in their entirety, U.S. Pat. amounts of memory as though all of that memory were a part 
No. 6,069,638, filed Jun. 25, 1997, titled "System for of ^ main memory of tne computer system. A virtual 
Accelerated Graphics Port Address Remapping Interface to mC mory system allows this even though actual main 
Main Memory", U.S. Pat. No. 6,282,625, filed Jun. 25, 1997, memory may cons is t 0 f some substantially lesser amount of 
titled "Method for Accelerated Graphics Port Address s(orage space than ig addresS able. For example, main 
Remapping Interface to Main Memory", U.S. application memory may include sixteen megabytes (16,777,216 bytes) 
Ser. No. 09/723,403, filed Nov. 27, 2000, titled "Method for of random access memory while a memory address- 
Implementing an Accelerated Graphics Port for a Multiple ■ tem permits ^ addressing of four gigabytes (4,294, 
Memory Controller Computer System", and U.S. Pat. No. 20 967 2 q 6 bytes \ 0 f mem0 ry 

6,252,612 filed Dec. 30, 1997, titled "Accelerated Graphics ^ mem s ide ^ bm usj , 

Port for Muluple Memory ControUer Computer Systems . memory manag / m / nt wit (MMU) t0 translale virtual 

BACKGROUND OF THE INVENTION memory addresses into their corresponding physical 

25 memory addresses, where the desired information actually 

1. Field of the Invention resides. A particular physical address holding desired infor- 
The present invention relates to computer systems, and mation may reside in main memory or in mass storage, such 

more particularly, to an apparatus for mapping virtual a s a tape drive or hard disk. If the physical address of the 

addresses to physical addresses in graphics applications. information is in main memory, the information is readily 

2. Description of the Related Technology 30 accessed and utilized. Otherwise, the information referenced 
As shown in FIG. 1, a conventional computer system by the physical address is in mass storage and the system 

architecture 100 includes a processor 102, system logic 104, transfers this information (usually in a block referred to as 

main memory 106, a system bus 108, a graphics accelerator a page) to main memory for subsequent use. This transfer 

110 communicating with a local frame buffer 112 and a may require the swapping of other information out of main 

plurality of peripherals 114. The processor 102 communi- 35 memory into mass storage in order to make room for the new 

cates with main memory 106 through a memory manage- information. If so, the MMU controls the swapping of 

ment unit (MMU) in the system logic 104. Peripherals 114 information to mass storage. 

and the graphics accelerator 110 communicate with main Pages are the usual mechanism used for addressing infor- 

memory 106 and system logic 104 through the system bus mation in a virtual memory system. Pages are numbered, 

108. The standard system bus 108 is currently the Periph- 40 and both physical and virtual addresses often include a page 

erals Connection Interface (PCI). The original personal number and an offset into the page. Moreover, the physical 

computer bus, the Industry Standard Architecture (ISA), is offset and the virtual offset are typically the same. In order 

capable of a peak data transfer rate of 8 megabytes/sec and to translate between the virtual and physical addresses, a 

is still used for low-bandwidth peripherals, such as audio. basic virtual memory system creates a series of lookup 

On the other hand, PCI supports multiple peripheral com- 45 tables, called page tables, stored in main memory. These 

ponents and add-in cards at a peak bandwidth of 132 page tables store the virtual address page numbers used by 

megabytes/sec. Thus, PCI is capable of supporting full the computer. Stored with each virtual address page number 

motion video playback at 30 frames/sec, true color high- is the corresponding physical address page number which 

resolution graphics and 100 megabits/sec Ethernet local area must be accessed to obtain the information. Often, the page 

networks. However, the emergence of high-bandwidth 50 tables are so large that they are paged themselves. The page 

applications, such as three dimensional (3D) graphics number of any virtual address presented to the memory 

applications, threatens to overload the PCI bus, management unit is compared to the values stored in these 

For example, a 3D graphics image is formed by taking a tables in order to find a matching virtual address page 

two dimensional image and applying, or mapping, it as a number for use in retrieving the corresponding physical 

surface onto a 3D object. The major kinds of maps include 55 address page number. 

texture maps, which deal with colors and textures, bump There are often several levels of tables, and the compari- 

maps, which deal with physical surfaces, reflection maps, son uses a substantial amount of system clock time. For 

refraction maps and chrome maps. Moreover, to add realism example, to retrieve a physical page address using lookup 

to a scene, 3D graphics accelerators often employ a z-buffer tables stored in main memory, the typical MMU first looks 

for hidden line removal and for depth queuing, wherein an 60 to a register for the address of a base table which stores 

intensity value is used to modify the brightness of a pixel as pointers to other levels of tables. The MMU retrieves this 

a function of distance. A z-buffer memory can be as large or pointer from the base table and places it in another register, 

larger than the memory needed to store two dimensional The MMU then uses this pointer to go to the next level of 

images. The graphics accelerator 110 retrieves and manipu- table. This process continues until the physical page address 

lates image data from the local frame buffer 112, which is a 65 of the information sought is recovered. When the physical 

type of expensive high performance memory. For example, address is recovered, it is combined with the offset furnished 

to transfer an average 3D scene (polygon overlap of three) as a part of the virtual address and the processor uses the 
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result to access the particular information desired. Comple- 
tion of a typical lookup in the page tables may take from ten 
to fifteen clock cycles at each level of the search. 

To overcome this delay, virtual management systems 
often include cache memories called translation look aside 
buffers (TLBs). A TLB is essentially a buffer for caching 
recently translated virtual page addresses along with their 
corresponding physical page addresses. Such an address 
cache works on the same principle as do caches holding data 
and instructions, the most recently used addresses are more 
likely to be used than are other addresses. Thus, if a 
subsequent virtual address refers to the same page as the last 
one, the page table lookup process is skipped to save time. 
A TLB entry is like a cache entry wherein a tag portion 
includes portions of the virtual address and the data portion 
includes a physical page frame number, protections fields, 
use bits and status bits. When provided with a virtual page 
address stored in the TLB (a translation hit), the TLB 
furnishes a physical page address for the information with- 
out having to consult any page lookup tables. When the 
processor requests a virtual page address not stored in the 
TLB (a translation miss), the MMU must then consult the 
page lookup tables. When this occurs, the physical page 
address recovered is stored along with the virtual page 
address in the TLB so that it is immediately available for 
subsequent use. This saves a substantial amount of time on 
the next use of the information. For example, accessing the 
information using a TLB may require only one or two clock 
cycles compared to the hundreds of clock cycles required for 
a page table lookup. 

Virtual memory systems are common in the art. For 
example, in U.S. Pat. No. 5,446,854, Khalidi et al, disclose 
a method and apparatus for virtual to physical address 
translation using hashing. Similarly, Crawford et al. disclose 
a microprocessor architecture having segmentation mecha- 
nisms for translating virtual addresses to physical addresses 
in U.S. Pat. No. 5,321,836. Lastly, in U.S. Pat. Nos. 5,491, 
806 and 5,546,555, Horstmann, et al. disclose an optimized 
translation lookaside buffer for use in a virtual memory 
system. 

As shown in FIG. 1, moving 3D graphics data to the main 
memory 106 in current computer systems would require the 
graphics accelerator 110 to access the 3D graphics data 
through the PCI system bus 108. Thus, although Bech- 
tolsheim discloses a data bus enabling virtual memory data 
transfers in U.S. Pat. Nos. 4,937,734 and 5,121,487, 3D 
rendering exceeds the peak PCI bandwidth of 132 
megabytes/sec because a bandwidth of at least 370 
megabytes/sec is needed for data transfer from main 
memory 106. Moreover, the graphics accelerator 110 often 
requires storage of graphics data into large contiguous 
blocks of memory. For example, a 16-bit 256x256 pixel 
texture map for 3D graphics applications requires a memory 
block having a size of 128K bytes. However, operating 
system software, such as Microsoft®, Windows®, Win- 
dows® 95 and Windows NT®, and the system logic 104 
often allocate main memory in page frames having smaller 
sizes, such as 4K. In U.S. Pat. No. 5,465,337, Kong dis- 
closes a memory management unit capable of handling 
virtual address translations for multiple page sizes. 
However, this does not address the bandwidth limitations of 
the PCI bus discussed above. In order to move 3D graphics 
data from the local frame buffer 112 to main memory 106, 
computer systems require an improved method for storing 
and addressing graphics data in main memory. 

In U.S. Pat. No, 5,313,577, Meinerth et al. discloses a 
graphics processor capable of reading from, and writing to, 
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virtual memory. This graphics processor can be described by 
reference to FIG. 2, which illustrates a graphics/memory 
control unit 120 including a graphics processor unit 122 that 
communicates with a memory control unit 124. The 

S graphics/memory control unit 120 in turn communicates 
with the main memory 106 and the frame buffer 112 through 
a dedicated memory bus 126. The graphics processor unit 
122 includes an address generator and a virtual translation 
unit to provide for translation of virtual addresses to physical 

10 addresses when accessing the main memory 106 and the 
frame buffer 112. In addition, the memory control unit 124 
communicates with a processor 102 through a dedicated 
system bus 128, with an I/O device 114 through a dedicated 
I/O bus 130 and with computer networks through a dedi- 

15 cated network bus 132. In contrast to the structure of FIG. 
1, the use of dedicated buses for communication with the 
main memory 106, I/O devices 114 and computer networks 
substantially increases system cost and decreases the flex- 
ibility with which a computer system can be upgraded. For 

20 example, to upgrade the graphics capability of a computer 
system having the structure as illustrated in FIG. 1, one 
simply connects a more powerful graphics adapter to the 
PCI bus 108 (FIG. 1). However, upgrading the graphics 
capability of a computer system having the structure of FIG. 

25 2 requires replacement of the memory control unit 124 as 
well as the graphics processor unit 122. Similarly, the 
structure of FIG. 2 is not compatible with the vast majority 
of available PCI enhancement devices. Moreover, the struc- 
ture of FIG. 2 also requires the graphics processor unit 122 

30 to access 3D graphics data through a memory bus 126. 
In view of the limitations discussed above, computer 
manufacturers require a modular architecture that reduces 
the cost of system upgrades, such as enhanced 3D graphics 
adapters, to improve display performance. Similarly, to 

35 reduce system memory costs, computer manufacturers 
require improved methods for storing, addressing and 
retrieving graphics data from main memory instead of 
expensive local frame buffer memory. Moreover, to address 
the needs of high bandwidth graphics applications without 

40 substantial increases in system cost, computer manufactur- 
ers require improved technology to overcome current system 
bus bandwidth limitations. 

SUMMARY OF THE INVENTION 

45 

One embodiment of the invention includes a graphics 
address remapping table (GART), the GART stored in 
memory, comprising at least one page table entry (PTE) 
providing information for translation of a virtual address to 

50 a physical address, wherein the virtual address includes a 
first portion and a second portion, the first portion being used 
to locate a PTE in the GART corresponding to the virtual 
address and wherein the second portion and the information 
provided by the PTE are combined to provide the physical 

5S address. 

Another embodiment of the invention includes a page 
table entry for a graphics address remapping table stored in 
memory comprising a physical page translation field having 
translation information and a feature bits field having at least 

go one indicator defining an attribute of the physical page 
translation field. 

Yet another embodiment of the invention includes a 
translation lookaside buffer (TLB) in a memory, the TLB 
receiving a portion of a virtual address selected from a 

65 graphics address remapping range, comprising at least one 
TLB entry, wherein each of the at least one TLB entries 
includes a virtual page field and a corresponding physical 
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page field, wherein if the portion of the virtual address 
matches the virtual page field of one TLB entry, the TLB 
provides translation information from the physical page field 
of the one TLB entry to form a physical address. 

Yet another embodiment of the invention includes an 5 
apparatus for graphic address remapping of a virtual address 
comprising a graphics address remapping table (GART) 
stored in memory and having information which is used to 
translate the virtual address to a physical address and a 
translation lookaside buffer (TLB) receiving a portion of the 10 
virtual address, the TLB having at least one TLB entry, 
wherein each of the at least one TLB entries includes a 
virtual page field and a corresponding physical page field, 
wherein if the portion of the virtual address matches the 
virtual page field of one TLB entry, the TLB provides 15 
translation information from the physical page field of the 
one TLB entry to form the physical address and wherein if 
the portion of the virtual address does not match the virtual 
page field of one TLB entry, the GART provides translation 
information referenced by the portion of the virtual address 20 
to form the physical address. 

Lastly, yet another embodiment of the present invention 
includes an apparatus for graphic address remapping of a 
virtual address comprising an interface and a translation 
lookaside buffer (TLB) in communication with the interface, 25 
the TLB having at least one TLB entry including informa- 
tion which is used to translate the virtual address to a 
physical address, wherein the interface receives a portion of 
the virtual address and provides access to a TLB entry 
corresponding to the portion of the virtual address. 30 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating the architecture of 
a prior art computer system. 35 

FIG. 2 is a block diagram illustrating the architecture of 
another prior art computer system. 

FIG. 3 is a block diagram illustrating the architecture of 
a computer system of one embodiment of the present 
invention. 40 

FIG. 4 is a diagram illustrating the address space of a 
processor of one embodiment of the present invention. 

FIG. Sa is a diagram illustrating the translation of a virtual 
address to a physical address of one embodiment of the 45 
present invention. 

FIG. 5b is a diagram illustrating a page table entry (PTE) 
of the graphic address remapping table (GART) of one 
embodiment of the present invention. 

FIG. 6a is a diagram illustrating the generation of a 50 
translation look aside buffer (TLB) entry of one embodiment 
of the present invention. 

FIG. 6b is a block diagram illustrating one embodiment of 
an interface for the direct access of a translation look aside 
buffer (TLB) of one embodiment of the present invention. 55 

FIG. 7 is a diagram illustrating the translation of a virtual 
address to a physical address using the TLB of one embodi- 
ment of the present invention. 

FIG. 8 is a flowchart illustrating the method of processing 60 
an AGP request of the present invention. 

FIG. 9 is a flowchart illustrating the method of updating 
a least recently used (LRU) counter of one embodiment of 
the present invention. 

FIG. 10 is a flowchart illustrating the method of selecting 65 
a slot to store a TLB entry of one embodiment of the present 
invention. 
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FIG. 11 is a flowchart illustrating the method of fetching 
a page table entry (PTE) of one embodiment of the present 
invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

In contrast to the conventional computer system architec- 
ture 100 (FIG. 1), embodiments of the present invention 
enables relocation of a portion of the 3D graphics data, such 
as the texture data, from the local frame buffer 112 (FIG. 1) 
to main memory 106 (FIG. 1) to reduce the size, and thus the 
cost, of the local frame buffer 112 and to improve system 
performance. For example, as texture data is generally read 
only, moving it to main memory 106 does not cause coher- 
ency or data consistency problems. Similarly, as texture data 
is loaded from mass storage into main memory 106, leaving 
it in main memory 106 instead of copying it to the local 
frame buffer 112 reduces overhead. Moreover, as the com- 
plexity and quality of 3D images has increased, leaving 3D 
graphics data in the local frame buffer 112 has served to 
increase the computer system cost over time. Thus, although 
moving 3D graphics data to main memory 106 may likewise 
require an increase in the size of the main memory 106, the 
architecture of the present invention reduces the total system 
cost because it is less expensive to increase main memory 
106 than to increase local frame buffer memory 112. 

Referring now to FIG. 3, the computer system architec- 
ture 150 of one embodiment of the present invention 
includes a processor 152, system logic 154, main memory 
156, a system bus 158, a graphics accelerator 160 commu- 
nicating with a local frame buffer 162 and a plurality of 
peripherals 164. The processor 152 communicates with the 
main memory 156 through a memory management unit 
(MMU) in the system logic 154. Peripherals 114 commu- 
nicate with the main memory 156 and system logic 154 
through the system bus 158. Note however that the graphics 
accelerator 160 communicates with the system logic 154 and 
main memory 156 through an accelerated graphics port 
(AGP) 166. AGP 166 is not a bus, but a point-to-point 
connection between an AGP compliant target, the MMU 
portion of the system logic 154, and an AGP-compliant 
master, the graphics accelerator 160. The AGP 166 point- 
to-point connection enables data transfer on both the rising 
and falling clock edges, improves data integrity, simplifies 
AGP protocols and eliminates bus arbitration overhead. 
AGP provides a protocol enhancement enabling pipelining 
for read and write accesses to main memory 156. 

For example, the graphics accelerator 160 initiates a 
pipelined transaction with an access request. System logic 
154 responds to the request by initiating the corresponding 
data transfer at a later time. The graphics accelerator 160 can 
then issue its next pipelined transaction while waiting for the 
previous data to return. This overlap results in several read 
or write requests being active at any point in time. In one 
embodiment, the AGP 166 operates with a 66 MHz clock 
using 3.3 volt signaling. As data transfers can occur on both 
clock edges, the AGP 166 enables an effective 133 MHz data 
transfer rate and can reach a peak bandwidth of 533 
megabytes/sec. For detailed information on the AGP 166, 
consult the Accelerated Graphics Port Interface 
Specification, Revision 1.0, released by Intel Corporation 
and available from Intel in Adobe" Acrobat" format on the 
World Wide Web . This document is hereby incorporated by 
reference. 

As noted above, the embodiment of FIG. 3 enables the 
graphics accelerator 160 to access both main memory 156 
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and the local frame buffer 162. From the perspective of the 
graphics accelerator 160, the main memory 156 and the local 
frame buffer 162 are logically equivalent. Thus, to optimize 
system performance, graphics data may be stored in either 
the main memory 156 or the local frame buffer 162. In 5 
contrast to the direct memory access (DMA) model where 
graphics data is copied from the main memory 156 into the 
local frame buffer 162 by a long sequential block transfer 
prior to use, the graphics accelerator 160 of the present 
invention can also use, or "execute," graphics data directly 10 
from the memory in which it resides (the "execute" model). 
However, since the main memory 156 is dynamically allo- 
cated in random pages of a selected size, such as 4K, the 
"execute" model requires an address mapping mechanism to 
map random pages into a single contiguous, physical ]5 
address space needed by the graphics accelerator 160. 

FIG. 4 illustrates an embodiment of the address space 180 
of the computer system 150 (FIG. 3) of the present inven- 
tion. For example, a 32 bit processor 152 (FIG. 3) has an 
address space 180 including 2 32 (or 4,294,967,296) different 20 
addresses. A computer system 150 (FIG. 3) typically uses 
different ranges of the address space 180 for different 
devices and system agents. In one embodiment, the address 
space 180 includes a local frame buffer range 182, a graphics 
address remapping table (GART) range 184 and a main 2 s 
memory range 186. In contrast to prior art systems, 
addresses falling within the GART range 184 are remapped 
to n on -contiguous pages within the main memory range 186. 
All addresses not in the GART range 184 are passed through 
without modification so that they map directly to the main 3 q 
memory range 186 or to device specific ranges, such as the 
local frame buffer range 182. In one embodiment, the system 
logic 154 performs the address remapping using a memory 
based table, the GART, defined in software with an appli- 
cation program interface (API). Moreover, the GART table 35 
format is abstracted to the API by a hardware abstraction 
layer (HAL) or a rniniport driver provided by the system 
logic 154. Thus, by denning the GART in software, the 
present invention advantageously provides the substantial 
implementation flexibility needed to address future parti- 40 
tioning and remapping circuitry (hardware) as well as any 
current or future compatibility issues. 

FIG. 5a illustrates the translation of a virtual address 200 
to a physical address 202 in one embodiment of the present 
invention. As discussed previously, in one embodiment, only 45 
those virtual addresses falling within the GART range 184 
(FIG. 4) are remapped to main memory 186 (FIG, 4). A 
virtual address 200 includes a virtual page number field 204 
and an offset field 206, Translation of the contents of the 
virtual page number field 204 occurs by finding a page table 50 
entry (PTE) corresponding to the virtual page number field 
204 among the plurality of GART PTEs 208 in the GART 
table 210. To identify the appropriate PTE having the 
physical address translation, the GART base address 212 is 
combined at 213 with the contents of the virtual page ss 
number field 204 to obtain a PTE address 214. The contents 
referenced by the PTE address 214 provide the physical 
page number 216 corresponding to the virtual page number 
204. The physical page number 216 is then combined at 217 
with the contents of the offset field 206 to form the physical 60 
address 202. The physical address 202 in turn references a 
location in main memory 218 having the desired informa- 
tion. 

The GART table 210 may include a plurality of PTEs 208 
having a size corresponding to the memory page size used 65 
by the processor 152 (FIG. 3). For example, an Intel" 
Pentium* or Pentium" Pro processor operates on memory 
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pages having a size of 4K. Thus, a GART table 210 adapted 
for use with these processors may include PTEs referencing 
4K pages. In one embodiment, the virtual page number field 
204 comprises the upper 20 bits and the offset field 206 
comprises the lower 12 bits of a 32 bit virtual address 200. 
Thus, each page includes 2 12 =4096 (4K) addresses and the 
lower 12 bits of the offset field 206 locate the desired 
information within a page referenced by the upper 20 bits of 
the virtual page number field 204. The GART table 210 
preferably resides in the main memory 218. Memory refers 
generally to storage devices, such as registers, SRAM, 
DRAM, flash memory, magnetic storage devices, optical 
storage devices and other forms of volatile and non- volatile 
storage. 

FIG. 5b illustrates one possible format for a GART PTE 
220. The GART PTE 220 includes a feature bits field 222 
and a physical page translation (PPT) field 224. In contrast 
to prior art systems where hardwired circuitry defines a page 
table format, the GART table 210 (FIG. 5a) may include 
PTEs of configurable length enabling optimization of table 
size and the use of feature bits defined by software. The 
length of the GART PTE 220 is 2 PTESize bytes or 8*2™^ 
bits. For example, for a PTESize=5, the GART PTE has a 
length of 32 bytes or 256 bits. The PPT field 224 includes 
PPTSize bits to generate a physical address 202 (FIG. 5a). 
PPTSize defines the number of translatable addresses, and 
hence the GART table 210 (FIG. 5a) includes 2 PPTSize PTE 
entries. As PTESize defines the size of each GART PTE 220, 
the memory space needed for the entire GART table 210 
(FIG. Sa) is 2^ PTBSix ^ PPTS ^ bytes. For example, the GART 
table 210 in a system with a 4K(=2 12 ) memory page size and 
32 megabytes (=2 2S ) of main memory 218 (FIG. 5a) 
includes 2 25 /2 12 =2 13 =8192 PTEs. Thus, only 13 bits are 
needed to define 8192 unique PTEs to span the entire 32 
megabytes of main memory 218 (FIG. 5a) and PPTSize«13. 
However, to accommodate various software feature bits, 
each PTE may have a size of 8 bytes (=2 3 and PTESize =3). 
Thus, the size of the GART table 210 is 2^ PTES ^ PPTS ^~ 
2 (3+13) =2 16 =65536 bytes«64K. 

As noted above, the GART table 210 (FIG. 5a) may use 
4K page boundaries. Thus, when (PTESize+PPTSize) is less 
than 12 bits (2 12 =4096 bytes-4K), the entire GART table 
210 (FIG. 5a) resides within one 4K page. For values greater 
than 12, the GART table 210 (FIG. 5a) resides on multiple 
4K pages. To maintain compatibility with the Intel 13 Pen- 
tium Pro processor caches, the GART base address 214 
(FIG. 5a) may begin on a 2 {plT:size+PPTSiz ^ byte boundary. 
Thus, a GART base address 214 (FIG. 5a) can not have a 
value which aligns the GART table 210 (FIG. 5a) on an 
address boundary less than the size of the GART table 210 
(FIG. 5a). For example, an 8K GART table 210 (FIG. 5a) 
must begin on a 8K boundary. 

In one embodiment, an initialization BIOS implements 
the GART table 210 (FIG. 5a) by loading configuration 
registers in the system logic 154 (FIG. 3) during system boot 
up. In another embodiment, the operating system imple- 
ments the GART table 210 (FIG. 5a) using an API to load 
the configuration registers in the system logic 154 (FIG. 3) 
during system boot up. The operating system then deter- 
mines the physical location of the GART table 210 (FIG. 5a) 
within main memory 218 (FIG. 5a) by selecting the proper 
page boundary as described above (i.e., an 8K GART table 
begins on an 8K boundary). For example, the system loads 
configuration registers holding the GART base address 214 
(FIG. 5a) defining the beginning of the GART table 210 
(FIG. 5a), PTESize defining the size of a GART PTE 220 
and PPTSize defining the size of the physical address used 
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to translate a virtual address. In addition, the system loads a 
configuration register forAGPAperture, defining the lowest 
address of the GART range 184 (FIG. 4), and PhysBase, 
defining the remaining bits needed to translate a virtual 
address not included in the PPTSize bits. 

For example, consider a system having 64 megabytes of 
main memory 218 (FIG. 5a) encompassing physical 
addresses 0 through 0x03FFFFFF with the AGP related data 
occupying the upper 32 megabytes of main memory 218 
referenced by physical addresses 0x02000000 through 
0x03FFFFFF. If the GART Range 184 (FIG. 4) begins at the 
256 megabyte virtual address boundary 0x10000000, the 
invention enables translation of virtual addresses within the 
GART Range 184 to physical addresses in the upper 32 
megabytes of main memory 218 corresponding to physical 
addresses in the range 0x02000000 through 0x03FFFFFF. 
As noted earlier, a GART table 210 includes multiple PTEs, 
each having physical page translation information 224 and 
software feature bits 222. The GART table 210 may be 
located at any physical address in the main memory 218, 
such as the 2 megabyte physical address 0x00200000. For a 
system having a 4K memory page size and a GART PTE 220 
of 8 byte length, the GART table 210 is configured as 
follows: 



PhysBase 
PhysSize 
AGPAperture 
GARTBase 

2FTESize 

PageSize 



■0x02000000 — Start of remapped physical address 

32 megabytes — Size of remapped physical addresses 
=0x10000000 —Start address of GART Range 
:~OxOD200000 —Start address of GART table 

£ bytes —Size of each GART Page Table Entry 

4 kilobytes — Memory page size 



PhysBase 

AGPAperture 

GARTBase 

PTESize 

PPTSize 



=0x02000000 — Start of remapped physical address 

;«0xl 0000000 — Start address of GART Range 

:«0x00200000 —Start address of GART table 

3 —Size of each GART PTE 

43 —Number of PPT bits in each PTE 



10 



20 



25 



To determine the number of PTEs in the GART table 210, 
the size of the physical address space in main memory 218 
allocated to AGP related data, the upper 32 megabytes* 
33554432 bytes, is divided by the memory page size, 
4K-4096 bytes, to obtain 8192 PTEs. Note that 8192~2 13 = 
2 PTEsiz* and thus ^ pTESize«13. To implement the GART 
table 210, the configuration registers are programmed with 
the following values: 



40 



45 



Lastly, the GART table 210 is initialized for subsequent use. 

Using pseudo-VHDLcode, system logic 154 (FIG. 3) can 
quickly determine whether a 32 bit AGP address (AGPAddr) 
requires translation from a virtual to physical address 
(PhysAddr) as follows: 

if ((AGPAddr(31 downto 12) and not {2 PPTSUe 
AGPAperture (31 downto 12)) then 
Virtual- true; 
else 

Virtu al=false; 
end if; 

When the address is virtual, then the PTE address 214 
(PTEAddr) is calculated as follows: 

PTEAddr<=((AGPAddr(31 downto 12) and (2V mu ">- 
1)) shl 2 p ^ Size ) or (GARTBase and not (2 iPTESUe+ 

Note that the "shl" function indicates a left shift with zero 
fill, which can be implemented in hardware using a multi- 



50 



60 



65 



plexer. Lastly, to determine the physical address 202 
(PhysAddr) when PPTSize does not include sufficient bits to 
remap the entire GART range 184 (FIG. 4), the physical 
page 216 is generated as follows: 

PhysAddr(31 downto 12)<-(PhysBase(31 downto 12) 
and not (2 PPTS **-1)) or (PTE and (2 PPTSUe -\))); 
To obtain the physical address 202, the physical page 216, 
PhysAddr(31 downto 12), is then combined with the offset 
206. Note that the pseudo-code above avoids the use of 
adders, which impede system performance at high clock 
frequencies, in the virtual to physical address translation 
process. 

To illustrate the use of the pseudo-code above, suppose an 
AGP master, such as the graphics accelerator 160 (FIG, 3), 
presents the virtual address 0x0002030, which corresponds 
to AGPAddr in the pseudo-code, to the system logic 154 
(FIG. 3) for translation. To determine if AGPAddr= 
0x10002030 is appropriate for translation using the GART 
table configured above, the system logic 154 first evaluates 
the if condition: 

((AGPAddr(31 downto 12) and not {2 PPTSize -\))- 
AGPAperture (31 downto 12)) 
to determine if it is true or false. In addition, the expression 
(2 PPTSize -\) indicates that the lower PPTSize bits are set, 
which is easily performed in hardware. For the GART table 
210 configured above, note that PPTSize=13, (2 PPTSi *'-l)~ 
OxOlIFFF (hexadecimal) and AGPAperture=Oxl0000O00. 
The notation (31 downto 12) indicates use of bit positions 12 
through 31 of an address, which is equivalent to truncating 
the lower 12 bits of a binary address or the lower three 
values of a hexadecimal address. Thus, for AGPAddr= 
0x10002030 and AGPAperture -0x1 0000000, AGPAddr(31 
downto 12)=0xl0002 and AGPAperture(31 downto 12)=* 
0x10000. Now, substitute the values for AGPAddr, AGPA- 
perture and (2 PPTSi **-l) into the if condition: 

((AGPAddr(31 downto 12) and not (2 PPTSize -l))= 
AGPAperture (31 downto 12)) -or- 

(0x10002 and not (Ox01FFF))=Oxl000O -or- 

0x10000=0x10000 
to calculate a result. Here, the result is true indicating that 
AGPAddr=Qxl0002030 is a valid address for translation. 
Similarly, for the virtual address 0x11002030, the if condi- 
tion produces this result: 0x11000=0x10000. As 
0x110000x10000, this result is false indicating that the 
virtual address 0x11002030 does not fall within the GART 
range 184. If an AGP master presented the virtual address 
0x11002030, the system logic 154 reports an error. 

To determine the location of the PTE in the GART table 
210 having the translation information for the virtual address 
AGPAddr=Qxl0002030, the expression: 

PTEAddr<=((AGPAddr(31 downto 12) and (2 (PPTSU ^- 
1)) shl 2 PTESU «) or (GARTBase and not (2 {PTESU ~ 

PPTSue)-!))) 

is evaluated. For the GART table 210 configured above, 
GARTBase=Ox00200000, PPTSize=13, PTESize=3 and 
( 2 <^"~^~<)-i).0x0FFFF. ^ nQted abovej (? PPTSUc_ 
l)=0x01FFF and AGPAddr(31 downto 12)=0xl0002. Now, 
substitute the values into the equation for PTEAddress: 
PTEAddr<=((0xl0002 and OxOlFFF) shl 3) or 

(0x00200000 and not (OxOFFFF)) -or- 
PTEAddr<=(0x00002 shl 3) or (0x00200000) -or- 
PTEAddr<=(Ox00000010) or (0x00200000)= 
0x00200010. 

As each PTE occupies 8 bytes and the GART table 210 
begins at the GARTBase address=0x00200000, the calcu- 
lated PTEAddress-Ox00200010 corresponds to the third 
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entry or PTE(2), 16 bytes away from the GARTBase 
address. Suppose that the lower 32 bits (or 4 bytes) of the 
value at PTE(2) =0x12345678. As shown in the embodiment 
of FIG. 5b, the lower PPTSize=13 bits correspond to the 
PPT translation bits and the higher order bits are software 5 
feature bits 222. Of course, in another embodiment, the PPT 
translation information may comprise the higher order bits 
while the software feature bits 222 may comprise the lower 
order bits. Moreover, the PPT translation information and 
the software feature bits 222 may be located at any of the bit 3Q 
positions within a PTE 220. 

Lastly, to calculate the physical address corresponding to 
the virtual address AGPAddr-Oxl 0002030, the expression: 
PhysAddr(31 downto 12)<=(PhysBase(31 downto 12) 
and not (2 PPTSUe -\)) or (PTE and (2"™"-l))) J5 
is evaluated. For the GART table 210 configured above, 
PhysBase=0x02000000 and (2 p/>r5 ^-l)=0x01FFF. Note 
also that PTE(2)=0xl2345678. Now, substitute the values 
into the equation for PhysAddr(31 downto 12): 

PhysAddr(31 downto 12)<-(0x02000 and not (OxOlFFF)) 20 

or (0x12345678 and OxOlFFF)) -or- 
PhysAddr(31 downto 12)<=(0x02000) or (0x00001678)= 
0x03678. Note that the offset 206 corresponds to the 
lower 12 bits of the virtual address 0x10002030 or 
AGPAddr(ll downto 0)-030. Thus, to obtain the 25 
physical address 206, the physical page 216 is com- 
bined with the offset 206 to form PhysAddr(31 downto 
0) or 0x03678030. To summarize, the pseudo-code of 
the embodiment described above illustrates the trans- 
lation of the virtual address 0x10002030 to the physical 30 
address 0x03678030. 
Moreover, the feature bits field 222 provides status infor- 
mation for use in virtual to physical address translations. In 
contrast to prior art systems, the feature bits of one embodi- 
ment of the present invention provide substantial design 35 
flexibility by enabling software to change the format of the 
GART table 210 (FIG. So) without the need for a costly 
redesign of the hardwired circuitry. For example, during an 
address translation, the system may need to verify that the 
physical address corresponding to the virtual address still 4 q 
includes valid data. Similarly, the system may need to 
determine if a referenced physical address has been read or 
written to. The contents of the feature bits field 222 provide 
this functionality. In one embodiment, the feature bits field 
222 includes indicators for PTE valid 226, page read 228 45 
and page write 230. These indicators 226, 228, 230 may be 
located anywhere within the feature bits field 222 and may 
be implemented using at least one bit. To implement an 
indicator, such as PTE valid 226, the present invention uses 
a mask register loaded during system boot up. Thus, for PTE 50 
valid 226, the ValidMask register is used to select the bit(s) 
to set in the feature bits field 222 to indicate a valid PTE. 
Similarly, for page read 228, the ReadMask register is used 
to select the bit(s) to set when a translated address has been 
read. Furthermore, for a page write 230, the WriteMask 55 
register is used to select the bit(s) to set when a translated 
address has been written to. For example, if ValidMask is 
zero, then no PTE Valid 226 indicator is defined. Otherwise, 
a PTE Valid 226 mask is defined and can be applied to a 
GART PTE 220 to determine if the physical address trans- 60 
lation is valid. The following VHDL pseudo-code imple- 
ments this logic: 

if ((ValidMask=0) or ((ValidMask and PTE)=ValidMask)) 
then 

PTEValid :=true; 65 
else 

PTEValid :-false; 
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end if; 

Similarly, to implement the page read 228 and page write 
230 indicators, a logical OR operation is performed on the 
GART PTE 220 using the WriteMask during write opera- 
tions and with the ReadMask during read operations. The 
resulting GART PTE 220 is then written to memory 218 
(FIG. 5a) to provide the page read 228 or page write 230 
status information. In a similar fashion, if the WriteMask or 
ReadMask is zero, then no page write 230 or page read 228 
indicator is defined and the GART PTE 220 is not written to 
memory. The following VHDL pseudo-code implements the 
page write 230 and page read 228 indicators: 

if ((WriteMask 0) and ((PTE and WriteMask) 

WriteMask)) 
then 

PTE :oPTE or WriteMask; 
UpdatePTE :=true; 
else 

PTE :-PTE; 
UpdatePTE :=false; 
end if; 

if ((ReadMask 0) and ((PTE and ReadMask) ReadMask)) 
then 

PTE :=PTE or ReadMask; 
UpdatePTE :-true; 
else 

PTE :=PTE; 
UpdatePTE :=false; 
end if; 

As discussed previously, the indicators 226, 228, 230 may be 
implemented by programming a mask register during system 
boot up. In one embodiment, the initialization BIOS pro- 
grams the mask register. In another embodiment, an oper- 
ating system API programs the mask register during system 
boot up. 

For example, suppose the following mask registers: 



ValidMask :=0x001 00000 —Position of Valid indicator in PTE 
WriteMask :=OxOO2O0DO0 — Position of Write indicator in PTE 
ReadMask : =0x0 0400 000 — Position of Read indicator in PTE 



are programmed during system boot up. To determine if the 
contents of a PTE 220 are valid, the if condition: 

((ValidMask-0) or ((ValidMask and PTE)-ValidMask)) is 
evaluated to determine if it is true or false. Referring 
back to the previous example, note that PTE(2)= 
0x12345678. Now, substitute the values of PTE(2) and 
ValidMask into the if condition: 
((ValidMask-0) or ((ValidMask and PTE)-ValidMask)) 
-or- 

((0x00100000=0) or ((0x00100000 and 0x12345678)= 
0x00100000)) -or- 

((0x00100000=0) or (0x00100000=0x00100000)) 
to calculate a result. Here, the result is true indicating that 
the PTE is valid. Similarly, for a ValidMask set to 
0x01000000, evaluation of the if condition proceeds as 
follows: 

((ValidMask=0) or ((ValidMask and PTE) = ValidMask)) 
-or- 

((0x01000000=0) or ((0x01000000 and 0x12345678)= 
0x01000000)) -or- 

((0x0100000-0) or (0x00000000-0x01000000)) 
to produce a false result as both (0x0100000 0) and 
(0x00000000 0x01000000), indicating an error reporting 
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ABSTRACT 



A modular architecture for storing, addressing and retrieving 
graphics data from main memory instead of expensive local 
frame buffer memory. A graphic address remapping table 
(GART), defined in software, is used to remap virtual 
addresses falling within a selected range, the GART range, 
to non-contiguous pages in main memory. Virtual address 
not within the selected range are passed without modifica- 
tion. The GART includes page table entries (PTEs) having 
translation information to remap virtual addresses falling 
within the GART range to their corresponding physical 
addresses. The GART PTEs are of configurable length 
enabling optimization of GART size and the use of feature 
bits, such as status indicators, defined by software. The 
GART is implemented during system boot up by configu- 
ration registers. Similarly, the PTEs are configured using 
mask registers. The GART may be used in conjunction with 
a translation lookaside buffer (TLB) to improve address 
remapping performance. 

18 Claims, 13 Drawing Sheets 
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APPARATUS COMPRISING A in 16-bit color at 30 frames/sec at 75 Hz screen refresh, 

TRANSLATION LOOKASIDE BUFFER FOR estimated bandwidths of 370 megabytes/sec to 840 

GRAPHICS ADDRESS REMAPPING OF megabytes/sec are needed for screen resolutions from 640x 

VIRTUAL ADDRESSES 480 resolution (VGA) to 1024x768 resolution (XGA). Thus, 

5 rendering of 3D graphics on a display requires a large 

CROSS REFERENCE TO RELATED amount of bandwidth between the graphics accelerator 110 

APPLICATIONS and the local frame buffer 112, where 3D texture maps and 

. . z-buffer data typically reside. 
This is a division of, and incorporates by reference in its In addition? many computer systems use virtual memory 
entirety, U.S. application Ser. No. 08/882,054, now U.S. Pat. ^ lQ permit the processor 102 to address more 
No. 6,249,853, titled "Apparatus for Graphic Address memory than is physically present in the main memory 106. 
Remapping", filed Jun. 25, 1997. This application is related A memory syslem allows addressing of very large 
to, and incorporates by reference in their entirety, U.S. Pat. amounts of memory as though all of that memory were a part 
No. 6,069,638, filed Jun. 25, 1997, titled "System for of ^ main memory of me computer system. A virtual 
Accelerated Graphics Port Address Remapping Interface to mem ory system allows this even though actual main 
Main Memory", U.S. Pat. No. 6,282,625, filed Jun. 25, 1997, memory may consist of some substantially lesser amount of 
titled "Method for Accelerated Graphics Port Address stQrage space than is addressa bl e . For example, main 
Remapping Interface to Mam Memory", U.S. application memory may include sixteen megabytes (16,777,216 bytes) 
Ser. No. 09/723,403, filed Nov. 27, 2000, titled "Method for of ran(Jom acces& memory while a v^a! mem ory address- 
Implementing an Accelerated Graphics Port for a Multiple • tem permits ^ addressing of four gigabytes (4,294, 
Memory Controller Computer System", and U.S. Pat. No. iU 957295 bytes) of memory 

6,252 612 filed Dec. 30, 1997, fitted "Accelerated Graphics ^ g idc ^ bm usi a 

Port for Multiple Memory Controller Computer Systems . memQry manag / m / nt mi * (MMU) t0 translate virtual 

BACKGROUND OF THE INVENTION memory addresses into their corresponding physical 

25 memory addresses, where the desired information actually 

1. Field of the Invention resides. A particular physical address holding desired infor- 
The present invention relates to computer systems, and mation may reside in main memory or in mass storage, such 

more particularly, to an apparatus for mapping virtual as a tape drive or hard disk. If the physical address of the 

addresses to physical addresses in graphics applications. information is in main memory, the information is readily 

2. Description of the Related Technology 30 accessed and utilized. Otherwise, the information referenced 
As shown in FIG. 1, a conventional computer system by the physical address is in mass storage and the system 

architecture 100 includes a processor 102, system logic 104, transfers this information (usually in a block referred to as 

main memory 106, a system bus 108, a graphics accelerator a page) to main memory for subsequent use. This transfer 

110 communicating with a local frame buffer 112 and a may require the swapping of other information out of main 

plurality of peripherals 114. The processor 102 communi- 35 memory into mass storage in order to make room for the new 

cates with main memory 106 through a memory manage- information. If so, the MMU controls the swapping of 

ment unit (MMU) in the system logic 104. Peripherals 114 information to mass storage. 

and the graphics accelerator 110 communicate with main Pages are the usual mechanism used for addressing infor- 

memory 106 and system logic 104 through the system bus mation in a virtual memory system. Pages are numbered, 

108. The standard system bus 108 is currently the Periph- 40 and both physical and virtual addresses often include a page 

erals Connection Interface (PCI). The original personal number and an offset into the page. Moreover, the physical 

computer bus, the Industry Standard Architecture (ISA), is offset and the virtual offset are typically the same. In order 

capable of a peak data transfer rate of 8 megabytes/sec and to translate between the virtual and physical addresses, a 

is still used for low-bandwidth peripherals, such as audio. basic virtual memory system creates a series of lookup 

On the other hand, PCI supports multiple peripheral com- 45 tables, called page tables, stored in main memory. These 

ponents and add-in cards at a peak bandwidth of 132 page tables store the virtual address page numbers used by 

megabytes/sec. Thus, PCI is capable of supporting full the computer. Stored with each virtual address page number 

motion video playback at 30 frames/sec, true color high- is the corresponding physical address page number which 

resolution graphics and 100 megabits/sec Ethernet local area must be accessed to obtain the information. Often, the page 

networks. However, the emergence of high -bandwidth 50 tables are so large that they are paged themselves. The page 

applications, such as three dimensional (3D) graphics number of any virtual address presented to the memory 

applications, threatens to overload the PCI bus. management unit is compared to the values stored in these 

For example, a 3D graphics image is formed by taking a tables in order to find a matching virtual address page 

two dimensional image and applying, or mapping, it as a number for use in retrieving the corresponding physical 

surface onto a 3D object. The major kinds of maps include 55 address page number. 

texture maps, which deal with colors and textures, bump There are often several levels of tables, and the compari- 

maps, which deal with physical surfaces, reflection maps, son uses a substantial amount of system clock time. For 

refraction maps and chrome maps. Moreover, to add realism example, to retrieve a physical page address using lookup 

to a scene, 3D graphics accelerators often employ a z-buffer tables stored in main memory, the typical MMU first looks 

for hidden line removal and for depth queuing, wherein an 60 to a register for the address of a base table which stores 

intensity value is used to modify the brightness of a pixel as pointers to other levels of tables. The MMU retrieves this 

a function of distance. A z-buffer memory can be as large or pointer from the base table and places it in another register, 

larger than the memory needed to store two dimensional The MMU then uses this pointer to go to the next level of 

images. The graphics accelerator 110 retrieves and manipu- table. This process continues until the physical page address 

lates image data from the local frame buffer 112, which is a 65 of the information sought is recovered. When the physical 

type of expensive high performance memory. For example, address is recovered, it is combined with the offset furnished 

to transfer an average 3D scene (polygon overlap of three) as a part of the virtual address and the processor uses the 
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result to access the particular information desired. Comple- 
tion of a typical lookup in the page tables may take from ten 
to fifteen clock cycles at each level of the search. 

To overcome this delay, virtual management systems 
often include cache memories called translation look aside 5 
buffers (TLBs). A TLB is essentially a buffer for caching 
recently translated virtual page addresses along with their 
corresponding physical page addresses. Such an address 
cache works on the same principle as do caches holding data 
and instructions, the most recently used addresses are more 10 
likely to be used than are other addresses. Thus, if a 
subsequent virtual address refers to the same page as the last 
one, the page table lookup process is skipped to save time. 
A TLB entry is like a cache entry wherein a tag portion 
includes portions of the virtual address and the data portion 15 
includes a physical page frame number, protections fields, 
use bits and status bits. When provided with a virtual page 
address stored in the TLB (a translation hit), the TLB 
furnishes a physical page address for the information with- 
out having to consult any page lookup tables. When the 20 
processor requests a virtual page address not stored in the 
TLB (a translation miss), the MMU must then consult the 
page lookup tables. When this occurs, the physical page 
address recovered is stored along with the virtual page 
address in the TLB so that it is immediately available for 25 
subsequent use. This saves a substantial amount of time on 
the next use of the information. For example, accessing the 
information using a TLB may require only one or two clock 
cycles compared to the hundreds of clock cycles required for 
a page table lookup, 30 

Virtual memory systems are common in the art. For 
example, in U.S. Pat. No. 5,446,854, Khalidi et al. disclose 
a method and apparatus for virtual to physical address 
translation using hashing. Similarly, Crawford et al. disclose 
a microprocessor architecture having segmentation mecha- 35 
nisms for translating virtual addresses to physical addresses 
in U.S. Pat, No. 5,321,836. Lastly, in U.S. Pat. Nos, 5,491, 
806 and 5,546,555, Horstmann, et al. disclose an optimized 
translation lookaside buffer for use in a virtual memory 
system. 40 

As shown in FIG. 1, moving 3D graphics data to the main 
memory 106 in current computer systems would require the 
graphics accelerator 110 to access the 3D graphics data 
through the PCI system bus 108. Thus, although Bech- 
tolsheim discloses a data bus enabling virtual memory data 45 
transfers in U.S. Pat. Nos. 4,937,734 and 5,121,487, 3D 
rendering exceeds the peak PCI bandwidth of 132 
megabytes/sec because a bandwidth of at least 370 
megabytes/sec is needed for data transfer from main 
memory 106. Moreover, the graphics accelerator 110 often 50 
requires storage of graphics data into large contiguous 
blocks of memory. For example, a 16-bit 256x256 pixel 
texture map for 3D graphics applications requires a memory 
block having a size of 128K bytes. However, operating 
system software, such as Microsoft®, Windows®, Win- 55 
dows® 95 and Windows NT®, and the system logic 104 
often allocate main memory in page frames having smaller 
sizes, such as 4K. In U.S. Pat. No. 5,465,337, Kong dis- 
closes a memory management unit capable of handling 
virtual address translations for multiple page sizes. 60 
However, this does not address the bandwidth limitations of 
the PCI bus discussed above. In order to move 3D graphics 
data from the local frame buffer 112 to main memory 106, 
computer systems require an improved method for storing 
and addressing graphics data in main memory. gs 

In U.S. Pat. No. 5,313,577, Meinertn et al. discloses a 
graphics processor capable of reading from, and writing to, 
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virtual memory. This graphics processor can be described by 
reference to FIG. 2, which illustrates a graphics/memory 
control unit 120 including a graphics processor unit 122 that 
communicates with a memory control unit 124. The 
graphics/memory control unit 120 in turn communicates 
with the main memory 106 and the frame buffer 112 through 
a dedicated memory bus 126. The graphics processor unit 
122 includes an address generator and a virtual translation 
unit to provide for translation of virtual addresses to physical 
addresses when accessing the main memory 106 and the 
frame buffer 112. In addition, the memory control unit 124 
communicates with a processor 102 through a dedicated 
system bus 128, with an I/O device 114 through a dedicated 
I/O bus 130 and with computer networks through a dedi- 
cated network bus 132. In contrast to the structure of FIG. 
1, the use of dedicated buses for communication with the 
main memory 106, I/O devices 114 and computer networks 
substantially increases system cost and decreases the flex- 
ibility with which a computer system can be upgraded. For 
example, to upgrade the graphics capability of a computer 
system having the structure as illustrated in FIG, 1, one 
simply connects a more powerful graphics adapter to the 
PCI bus 108 (FIG. 1). However, upgrading the graphics 
capability of a computer system having the structure of FIG. 
2 requires replacement of the memory control unit 124 as 
well as the graphics processor unit 122. Similarly, the 
structure of FIG. 2 is not compatible with the vast majority 
of available PCI enhancement devices. Moreover, the struc- 
ture of FIG. 2 also requires the graphics processor unit 122 
to access 3D graphics data through a memory bus 126. 

In view of the limitations discussed above, computer 
manufacturers require a modular architecture that reduces 
the cost of system upgrades, such as enhanced 3D graphics 
adapters, to improve display performance. Similarly, to 
reduce system memory costs, computer manufacturers 
require improved methods for storing, addressing and 
retrieving graphics data from main memory instead of 
expensive local frame buffer memory. Moreover, to address 
the needs of high bandwidth graphics applications without 
substantial increases in system cost, computer manufactur- 
ers require improved technology to overcome current system 
bus bandwidth limitations. 

SUMMARY OF THE INVENTION 

One embodiment of the invention includes a graphics 
address remapping table (GART), the GART stored in 
memory, comprising at least one page table entry (PTE) 
providing information for translation of a virtual address to 
a physical address, wherein the virtual address includes a 
first portion and a second portion, the first portion being used 
to locate a PTE in the GART corresponding to the virtual 
address and wherein the second portion and the information 
provided by the PTE are combined to provide the physical 
address. 

Another embodiment of the invention includes a page 
table entry for a graphics address remapping table stored in 
memory comprising a physical page translation field having 
translation information and a feature bits field having at least 
one indicator defining an attribute of the physical page 
translation field. 

Yet another embodiment of the invention includes a 
translation lookaside buffer (TLB) in a memory, the TLB 
receiving a portion of a virtual address selected from a 
graphics address remapping range, comprising at least one 
TLB entry, wherein each of the at least one TLB entries 
includes a virtual page field and a corresponding physical 
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page field, wherein if the portion of the virtual address 
matches the virtual page field of one TLB entry, the TLB 
provides translation information from the physical page field 
of the one TLB entry to form a physical address. 

Yet another embodiment of the invention includes an 5 
apparatus for graphic address remapping of a virtual address 
comprising a graphics address remapping table (GART) 
stored in memory and having information which is used to 
translate the virtual address to a physical address and a 
translation lookaside buffer (TLB) receiving a portion of the 10 
virtual address, the TLB having at least one TLB entry, 
wherein each of the at least one TLB entries includes a 
virtual page field and a corresponding physical page field, 
wherein if the portion of the virtual address matches the 
virtual page field of one TLB entry, the TLB provides 1 5 
translation information from the physical page field of the 
one TLB entry to form the physical address and wherein if 
the portion of the virtual address does not match the virtual 
page field of one TLB entry, the GART provides translation 
information referenced by the portion of the virtual address 20 
to form the physical address. 

Lastly, yet another embodiment of the present invention 
includes an apparatus for graphic address remapping of a 
virtual address comprising an interface and a translation 
lookaside buffer (TLB) in communication with the interface, 25 
the TLB having at least one TLB entry including informa- 
tion which is used to translate the virtual address to a 
physical address, wherein the interface receives a portion of 
the virtual address and provides access to a TLB entry 
corresponding to the portion of the virtual address. 30 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating the architecture of 
a prior art computer system. 35 

FIG. 2 is a block diagram illustrating the architecture of 
another prior art computer system. 

FIG. 3 is a block diagram illustrating the architecture of 
a computer system of one embodiment of the present 
invention. 40 

FIG. 4 is a diagram illustrating the address space of a 
processor of one embodiment of the present invention. 

FIG. 5a is a diagram illustrating the translation of a virtual 
address to a physical address of one embodiment of the 45 
present invention. 

FIG. Sb is a diagram illustrating a page table entry (PTE) 
of the graphic address remapping table (GART) of one 
embodiment of the present invention. 

FIG. 6a is a diagram illustrating the generation of a 50 
translation look aside buffer (TLB) entry of one embodiment 
of the present invention. 

FIG. 6b is a block diagram illustrating one embodiment of 
an interface for the direct access of a translation look aside 
buffer (TLB) of one embodiment of the present invention. 55 

FIG. 7 is a diagram illustrating the translation of a virtual 
address to a physical address using the TLB of one embodi- 
ment of the present invention. 

FIG. 8 is a flowchart illustrating the method of processing 60 
an AGP request of the present invention. 

FIG. 9 is a flowchart illustrating the method of updating 
a least recently used (LRU) counter of one embodiment of 
the present invention. 

FIG. 10 is a flowchart illustrating the method of selecting 65 
a slot to store a TLB entry of one embodiment of the present 
invention. 
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FIG. 11 is a flowchart illustrating the method of fetching 
a page table entry (PTE) of one embodiment of the present 
invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

In contrast to the conventional computer system architec- 
ture 100 (FIG. 1), embodiments of the present invention 
enables relocation of a portion of the 3D graphics data, such 
as the texture data, from the local frame buffer 112 (FIG. 1) 
to main memory 106 (FIG. 1) to reduce the size, and thus the 
cost, of the local frame buffer 112 and to improve system 
performance. For example, as texture data is generally read 
only, moving it to main memory 106 does not cause coher- 
ency or data consistency problems. Similarly, as texture data 
is loaded from mass storage into main memory 106, leaving 
it in main memory 106 instead of copying it to the local 
frame buffer 112 reduces overhead. Moreover, as the com- 
plexity and quality of 3D images has increased, leaving 3D 
graphics data in the local frame buffer 112 has served to 
increase the computer system cost over time. Thus, although 
moving 3D graphics data to main memory 106 may likewise 
require an increase in the size of the main memory 106, the 
architecture of the present invention reduces the total system 
cost because it is less expensive to increase main memory 
106 than to increase local frame buffer memory 112. 

Referring now to FIG. 3, the computer system architec- 
ture 150 of one embodiment of the present invention 
includes a processor 152, system logic 154, main memory 
156, a system bus 158, a graphics accelerator 160 commu- 
nicating with a local frame buffer 162 and a plurality of 
peripherals 164. The processor 152 communicates with the 
main memory 156 through a memory management unit 
(MMU) in the system logic 154. Peripherals 114 commu- 
nicate with the main memory 156 and system logic 154 
through the system bus 158. Note however that the graphics 
accelerator 160 communicates with the system logic 154 and 
main memory 156 through an accelerated graphics port 
(AGP) 166. AGP 166 is not a bus, but a point-to-point 
connection between an AGP compliant target, the MMU 
portion of the system logic 154, and an AGP-compliant 
master, the graphics accelerator 160, The AGP 166 point- 
to-point connection enables data transfer on both the rising 
and falling clock edges, improves data integrity, simplifies 
AGP protocols and eliminates bus arbitration overhead. 
AGP provides a protocol enhancement enabling pipelining 
for read and write accesses to main memory 156. 

For example, the graphics accelerator 160 initiates a 
pipelined transaction with an access request. System logic 
154 responds to the request by initiating the corresponding 
data transfer at a later time. The graphics accelerator 160 can 
then issue its next pipelined transaction while waiting for the 
previous data to return. This overlap results in several read 
or write requests being active at any point in time. In one 
embodiment, the AGP 166 operates with a 66 MHz clock 
using 3.3 volt signaling. As data transfers can occur on both 
clock edges, the AGP 166 enables an effective 133 MHz data 
transfer rate and can reach a peak bandwidth of 533 
megabytes/sec. For detailed information on the AGP 166, 
consult the Accelerated Graphics Port Interface 
Specification, Revision 1.0, released by Intel Corporation 
and available from Intel in Adobe 11 Acrobat" format on the 
World Wide Web . This document is hereby incorporated by 
reference. 

As noted above, the embodiment of FIG. 3 enables the 
graphics accelerator 160 to access both main memory 156 
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and the local frame buffer 162. From the perspective of the pages having a size of 4K. Thus, a GART table 210 adapted 

graphics accelerator 160, the main memory 156 and the local for use with these processors may include PTEs referencing 

frame buffer 162 are logically equivalent. Thus, to optimize 4K pages. In one embodiment, the virtual page number field 

system performance, graphics data may be stored in either 204 comprises the upper 20 bits and the offset field 206 

the main memory 156 or the local frame buffer 162. In s comprises the lower 12 bits of a 32 bit virtual address 200. 

contrast to the direct memory access (DMA) model where Thus, each page includes 2 12 =4096 (4K) addresses and the 

graphics data is copied from the main memory 156 into the lower 12 bits of the offset field 206 locate the desired 

local frame buffer 162 by a long sequential block transfer information within a page referenced by the upper 20 bits of 

prior to use, the graphics accelerator 160 of the present the virtual page number field 204, The GART table 210 

invention can also use, or "execute," graphics data directly 10 preferably resides in the main memory 218. Memory refers 

from the memory in which it resides (the "execute" model). generally to storage devices, such as registers, SRAM, 

However, since the main memory 156 is dynamically alio- DRAM, flash memory, magnetic storage devices, optical 

cated in random pages of a selected size, such as 4K, the storage devices and other forms of volatile and non-volatile 

"execute" model requires an address mapping mechanism to storage. 

map random pages into a single contiguous, physical 15 FIG. 5b illustrates one possible format for a GART PTE 

address space needed by the graphics accelerator 160. 220. The GART PTE 220 includes a feature bits field 222 

FIG. 4 illustrates an embodiment of the address space 180 and a physical page translation (PPT) field 224. In contrast 

of the computer system 150 (FIG. 3) of the present inven- to prior art systems where hardwired circuitry defines a page 

tion. For example, a 32 bit processor 152 (FIG. 3) has an table format, the GART table 210 (FIG. 5a) may include 

address space 180 including 2 32 (or 4,294,967,296) different 2 o PTEs of configurable length enabling optimization of table 

addresses. A computer system 150 (FIG. 3) typically uses size and the use of feature bits defined by software. The 

different ranges of the address space 180 for different length of the GART PTE 220 is 2 PTESize bytes or s*2 PTESu:e 

devices and system agents. In one embodiment, the address bits. For example, for a PTESize=5, the GART PTE has a 

space 180 includes a local frame buffer range 182, a graphics length of 32 bytes or 256 bits. The PPT field 224 includes 

address remapping table (GART) range 184 and a main 25 PPTSize bits to generate a physical address 202 (FIG. 5a). 

memory range 186. In contrast to prior art systems, PPTSize defines the number of translatable addresses, and 

addresses falling within the GART range 184 are remapped hence the GART table 210 (FIG. Sa) includes 2 PPTSlze PTE 

to non-contiguous pages within the main memory range 186 . entries. As PTESize defines the size of each GART PTE 220, 

All addresses not in the GART range 184 are passed through the memory space needed for the entire GART table 210 

without modification so that they map directly to the main 30 (FIG. 5a) is 2^ PTESU ^ PPTSix ^ bytes. For example, the GART 

memory range 186 or to device specific ranges, such as the table 210 in a system with a 4K(=2 12 ) memory page size and 

local frame buffer range 182. In one embodiment, the system 32 megabytes (=2 2S ) of main memory 218 (FIG. 5a) 

logic 154 performs the address remapping using a memory includes 2 2 72 J2 =2 13 =8192 PTEs. Thus, only 13 bits are 

based table, the GART, defined in software with an appli- needed to define 8192 unique PTEs to span the entire 32 

cation program interface (API). Moreover, the GART table 35 megabytes of main memory 218 (FIG. 5a) and PPTSize=13. 

format is abstracted to the API by a hardware abstraction However, to accommodate various software feature bits, 

layer (HAL) or a miniport driver provided by the system each PTE may have a size of 8 bytes (=2 3 an^FTESiK=3). 

logic 154. Thus, by defining the GART in software, the Thus, the size of the GART table 210 is 2 < > PTES " x + pprSLCe >- 

present invention advantageously provides the substantial 2 (3+13) =2 16 =65536 bytes=64K, 

implementation flexibility needed to address future parti- 40 As noted above, the GART table 210 (FIG. 5a) may use 

tioning and remapping circuitry (hardware) as well as any 4Kpage boundaries. Thus, when (PTESize+PPTSize) is less 

current or future compatibility issues. than 12 bits (2 12 =4096 bytes=4K), the entire GART table 

FIG. 5a illustrates the translation of a virtual address 200 210 (FIG. 5a) resides within one 4K page. For values greater 

to a physical address 202 in one embodiment of the present than 12, the GART table 210 (FIG. 5a) resides on multiple 

invention. As discussed previously, in one embodiment, only 45 4K pages. To maintain compatibility with the Intel Pen- 

those virtual addresses falling within the GART range 184 tium Pro processor caches, the GART base address 214 

(FIG. 4) are remapped to main memory 186 (FIG. 4). A (FIG. 5a) may begin on a 2^ msu ** FFTSa *'> byte boundary, 

virtual address 200 includes a virtual page number field 204 Thus, a GART base address 214 (FIG. 5a) can not have a 

and an offset field 206. Translation of the contents of the value which aligns the GART table 210 (FIG. 5a) on an 

virtual page number field 204 occurs by finding a page table 50 address boundary less than the size of the GART table 210 

entry (PTE) corresponding to the virtual page number field (FIG. 5a). For example, an 8K GART table 210 (FIG. 5a) 

204 among the plurality of GART PTEs 208 in the GART must begin on a 8K boundary. 

table 210. To identify the appropriate PTE having the In one embodiment, an initialization BIOS implements 

physical address translation, the GART base address 212 is the GART table 210 (FIG. Sa) by loading configuration 

combined at 213 with the contents of the virtual page 55 registers in the system logic 154 (FIG. 3) during system boot 

number field 204 to obtain a PTE address 214. The contents up. In another embodiment, the operating system imple- 

referenced by the PTE address 214 provide the physical ments the GART table 210 (FIG. 5a) using an API to load 

page number 216 corresponding to the virtual page number the configuration registers in the system logic 154 (FIG. 3) 

204. The physical page number 216 is then combined at 217 during system boot up. The operating system then deter- 

with the contents of the offset field 206 to form the physical $o mines the physical location of the GART table 210 (FIG. 5a) 

address 202. The physical address 202 in turn references a within main memory 218 (FIG. 5a) by selecting the proper 

location in main memory 218 having the desired informa- page boundary as described above (i.e., an 8K GART table 

tion. begins on an 8K boundary). For example, the system loads 

The GART table 210 may include a plurality of PTEs 208 configuration registers holding the GART base address 214 

having a size corresponding to the memory page size used 65 (FIG. 5a) defining the beginning of the GART table 210 
by the processor 152 (FIG. 3). For example, an Intef (FIG. 5a), PTESize defining the size of a GART PTE 220 

Pentium 11 or Pentium 11 Pro processor operates on memory and PPTSize defining the size of the physical address used 
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to translate a virtual address. In addition, the system loads a 
configuration register forAGPAperture, defining the lowest 
address of the GART range 184 (FIG. 4), and PhysBase, 
defining the remaining bits needed to translate a virtual 
address not included in the PPTSize bits. 

For example, consider a system having 64 megabytes of 
main memory 218 (FIG. 5a) encompassing physical 
addresses 0 through Qx03FFFFFF with the AGP related data 
occupying the upper 32 megabytes of main memory 218 
referenced by physical addresses 0x02000000 through 
0x03FFFFFF. If the GART Range 184 (FIG. 4) begins at the 
256 megabyte virtual address boundary 0x10000000, the 
invention enables translation of virtual addresses within the 
GART Range 184 to physical addresses in the upper 32 
megabytes of main memory 218 corresponding to physical 
addresses in the range 0x02000000 through 0x03FFFFFF. 
As noted earlier, a GART table 210 includes multiple PTEs, 
each having physical page translation information 224 and 
software feature bits 222. The GART table 210 may be 
located at any physical address in the main memory 218, 
such as the 2 megabyte physical address 0x00200000. For a 
system having a 4K memory page size and a GART PTE 220 
of 8 byte length, the GART table 210 is configured as 
follows: 



FhysBase 
PhysSize 
AGPApeiture 
GARTBase 

PageSize 



=0x02000000 — Start of remapped physical address 

=32 megabytes — Size of remapped physical addresses 

=0x10000000 — Start address of GART Range 

=0x00200000 — Start address of GART table 

=8 bytes —Size of each GART Page Table Entry 

4 kilobytes — Memory page size 



FhysBase 

AGPAperture 

GARTBase 

PTESize 

PPTSize 



-0x02000000 — Start of remapped physical address 

:«0x 10000000 — Start address of GART Range 

=0x00200000 — Start address of GART table 

:=3 --Size of each GART PTE 

:-13 — Number of PPT bits in each PTE 



10 



25 



30 



To determine the number of PTEs in the GART table 210, 
the size of the physical address space in main memory 218 
allocated to AGP related data, the upper 32 megabytes- 
33554432 bytes, is divided by the memory page size, 
4K=4096 bytes, to obtain 8192 PTEs. Note that 8192=2 13 = 
2 ptes^ and pTESize=13. To implement the GART 
table 210, the configuration registers are programmed with 
the following values: 



35 



40 



Lastly, the GART table 210 is initialized for subsequent use. 

Using pseudo-VHDL code, system logic 154 (FIG. 3) can 
quickly determine whether a 32 bit AGP address (AGPAddr) 
requires translation from a virtual to physical address 
(PhysAddr) as follows: 

if ((AGPAddr(31 downto 12) and not (2 pprsUe " l »= 
AGPAperture (31 downto 12)) then 
Virtual-true; 
else 

Virtual=false; 
end if; 

When the address is virtual, then the PTE address 214 
(PTEAddr) is calculated as follows: 

PTEAddr<=((AGPAddr(31 downto 12) and (2 (PPTSiz ^- 
1)) shl 2 pYESize ) or (GARTBase and not (2 {PTlLs ^ 

PPTSute)-l))); 

Note that the "shl" function indicates a left shift with zero 
fill, which can be implemented in hardware using a multi- 



50 



60 



65 



plexer. Lastly, to determine the physical address 202 
(PhysAddr) when PPTSize does not include sufficient bits to 
remap the entire GART range 184 (FIG. 4), the physical 
page 216 is generated as follows: 

PhysAddr(31 downto 12)<-(PhysBase(31 downto 12) 
and not (2 PF7 ^-1)) or (PTE and (2 PPTS ™-X))); 
To obtain the physical address 202, the physical page 216, 
PhysAddr(31 downto 12), is then combined with the offset 
206. Note that the pseudo-code above avoids the use of 
adders, which impede system performance at high clock 
frequencies, in the virtual to physical address translation 
process. 

To illustrate the use of the pseudo-code above, suppose an 
AGP master, such as the graphics accelerator 160 (FIG. 3), 
presents the virtual address 0x0002030, which corresponds 
to AGPAddr in the pseudo-code, to the system logic 154 
(FIG. 3) for translation. To determine if AGPAddr= 
0x10002030 is appropriate for translation using the GART 
table configured above, the system logic 154 first evaluates 
the if condition: 

((AGPAddr(31 downto 12) and not {2 PPTSize -Vj)- 
AGPAperture (31 downto 12)) 
to determine if it is true or false. In addition, the expression 
(2 ppr ^'-l) indicates that the lower PPTSize bits are set, 
which is easily performed in hardware. For the GART table 
210 configured above, note that PPTSize=13, (2 PPTSU *-l)= 
OxOlIFFF (hexadecimal) and AGPAperture=OxlO00O0O0. 
The notation (31 downto 12) indicates use of bit positions 12 
through 31 of an address, which is equivalent to truncating 
the lower 12 bits of a binary address or the lower three 
values of a hexadecimal address. Thus, for AGPAddr* 
0x10002030 and AGPAperture-OxlOOOOOOO, AGPAddr(31 
downto 12)«0xl0002 and AGPAperture(31 downto 12)= 
0x10000. Now, substitute the values for AGPAddr, AGPA- 
perture and (2 PPTSL "-1) into the if condition: 

((AGPAddr(31 downto 12) and not (2 PPTSi **-l))« 
AGPAperture (31 downto 12)) -or- 

(0x10002 and not (Ox01FFF))=Oxl0000 -or- 

0x10000=0x10000 
to calculate a result. Here, the result is true indicating that 
AGPAddr=Qx 10002030 is a valid address for translation. 
Similarly, for the virtual address 0x11002030, the if condi- 
tion produces this result: 0x11000=0x10000. As 
0x110000x10000, this result is false indicating that the 
virtual address 0x11002030 does not fall within the GART 
range 184. If an AGP master presented the virtual address 
0x11002030, the system logic 154 reports an error. 

To determine the location of the PTE in the GART table 
210 having the translation information for the virtual address 
AGPAddr=0xl 0002030, the expression: 

PTEAddr<=((AGPAddr(31 downto 12) and (2<"™»>- 
1)) shl 2 PTBSUe ) or (GARTBase and not (2 iPTESi *" 

FFTSizeyiyj) 

is evaluated. For the GART table 210 configured above, 
GARTBase=Ox00200000, PPTSize-13, PTESize=3 and 
( 2 < OT ^~ p ™ te >-l)^x0FFFF. As noted above, (2™*"- 
l)=0x01FFF and AGPAddr(31 downto 12)=0xl0002. Now, 
substitute the values into the equation for PTEAddress: 
PTEAddr<=((0xl0002 and OxOlFFF) shl 3) or 

(0x00200000 and not (OxOFFFF)) -or- 
PTEAddr<=(0x00002 shl 3) or (0x00200000) -or- 
PTEAddr< = (Ox00000010) or (0x00200000)= 
0x00200010. 

As each PTE occupies 8 bytes and the GART table 210 
begins at the GARTBase address=0x00200000, the calcu- 
lated PTEAddress-0x00200010 corresponds to the third 
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entry or PTE(2), 16 bytes away from the GARTBase 
address. Suppose that the lower 32 bits (or 4 bytes) of the 
value at PTE(2)=0xl2345678. As shown in the embodiment 
of FIG. Sb, the lower PPTSize=13 bits correspond to the 
PPT translation bits and the higher order bits are software 
feature bits 222. Of course, in another embodiment, the PPT 
translation information may comprise the higher order bits 
while the software feature bits 222 may comprise the lower 
order bits. Moreover, the PPT translation information and 
the software feature bits 222 may be located at any of the bit 
positions within a PTE 220. 

Lastly, to calculate the physical address corresponding to 
the virtual address AGPAddr-Oxl 0002030, the expression: 

PhysAddr(31 downto 12)<=(PhysBase(31 downto 12) 
and not {2 PPTSUe -\)) or (PTE and (2 PPTS * e -l))) 
is evaluated. For the GART table 210 configured above, 
PhysBase=0x02000000 and (2 F/>7;s£a? -l)=0x01FFF. Note 
also that PTE(2)=0xl2345678. Now, substitute the values 
into the equation for PhysAddr(31 downto 12): 

PhysAddr(31 downto 12)<-(0x02000 andnot(0x01FFF)) 
or (0x12345678 and OxOlFFF)) -or- 

PhysAddr(31 downto 12)<=(0x02000) or (0x00001678)= 
0x03678. Note that the offset 206 corresponds to the 
lower 12 bits of the virtual address 0x10002030 or 
AGPAddr(ll downto 0)-030. Thus, to obtain the 
physical address 206, the physical page 216 is com- 
bined with the offset 206 to form PhysAddr(31 downto 
0) or 0x03678030. To summarize, the pseudo-code of 
the embodiment described above illustrates the trans- 
lation of the virtual address 0x10002030 to the physical 
address 0x03678030. 

Moreover, the feature bits field 222 provides status infor- 
mation for use in virtual to physical address translations. In 
contrast to prior art systems, the feature bits of one embodi- 
ment of the present invention provide substantial design 
flexibility by enabling software to change the format of the 
GART table 210 (FIG. 5a) without the need for a costly 
redesign of the hardwired circuitry. For example, during an 
address translation, the system may need to verify that the 
physical address corresponding to the virtual address still 
includes valid data. Similarly, the system may need to 
determine if a referenced physical address has been read or 
written to. The contents of the feature bits field 222 provide 
this functionality. In one embodiment, the feature bits field 
222 includes indicators for PTE valid 226, page read 228 
and page write 230. These indicators 226, 228, 230 may be 
located anywhere within the feature bits field 222 and may 
be implemented using at least one bit. To implement an 
indicator, such as PTE valid 226, the present invention uses 
a mask register loaded during system boot up. Thus, for PTE 
valid 226, the ValidMask register is used to select the bit(s) 
to set in the feature bits field 222 to indicate a valid PTE. 
Similarly, for page read 228, the ReadMask register is used 
to select the bit(s) to set when a translated address has been 
read. Furthermore, for a page write 230, the WriteMask 
register is used to select the bit(s) to set when a translated 
address has been written to. For example, if ValidMask is 
zero, then no PTE Valid 226 indicator is defined. Otherwise, 
a PTE Valid 226 mask is defined and can be applied to a 
GART PTE 220 to determine if the physical address trans- 
lation is valid. The following VHDL pseudo-code imple- 
ments this logic: 

if ((ValidMask=0) or ((ValidMask and PTE)=ValidMask)) 
then 

PTEValid :=true; 
else 

PTEValid :-false; 
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end if; 

Similarly, to implement the page read 228 and page write 
230 indicators, a logical OR operation is performed on the 
GART PTE 220 using the WriteMask during write opera- 
tions and with the ReadMask during read operations. The 
resulting GART PTE 220 is then written to memory 218 
(FIG. Sa) to provide the page read 228 or page write 230 
status information. In a similar fashion, if the WriteMask or 
ReadMask is zero, then no page write 230 or page read 228 
indicator is defined and the GART PTE 220 is not written to 
memory. The following VHDL pseudo-code implements the 
page write 230 and page read 228 indicators: 

if ((WriteMask 0) and ((PTE and WriteMask) 
WriteMask)) 

then 

PTE :=PTE or WriteMask; 

UpdatePTE :=true; 
else 

PTE :-PTE; 

UpdatePTE :«false; 
end if; 

if ((ReadMask 0) and ((PTE and ReadMask) ReadMask)) 
then 

PTE :=PTE or ReadMask; 

UpdatePTE :-true; 
else 

PTE :=PTE; 

UpdatePTE :=false; 
end if; 

As discussed previously, the indicators 226, 228, 230 may be 
implemented by programming a mask register during system 
boot up. In one embodiment, the initialization BIOS pro- 
grams the mask register. In another embodiment, an oper- 
ating system API programs the mask register during system 
boot up. 

For example, suppose the following mask registers: 



ValidMask :=0x001 00000 —Position of Valid indicator in PTE 
WriteMask : =0x00200000 — Position of Write indicator in PTE 
ReadMask : =0x00400000 — Position of Read indicator in PTE 



are programmed during system boot up. To determine if the 
contents of a PTE 220 are valid, the if condition: 

((ValidMask-0) or ((ValidMask and PTE)-VaIidMask)) is 
evaluated to determine if it is true or false. Referring 
back to the previous example, note that PTE(2)= 
0x12345678. Now, substitute the values of PTE(2) and 
ValidMask into the if condition: 
((ValidMask-0) or ((ValidMask and PTE) -ValidMask)) 
-or- 

((0x00100000=0) or ((0x00100000 and 0x12345678)= 
0x00100000)) -or- 

((0x00100000=0) or (0x00 1 00000=0x00 1 00000)) 
to calculate a result. Here, the result is true indicating that 
the PTE is valid. Similarly, for a ValidMask set to 
0x01000000, evaluation of the if condition proceeds as 
follows: 

((ValidMask=0) or ((ValidMask and PTE)oValidMask)) 
-or- 

((0x01000000=0) or ((0x01000000 and 0x12345678)= 
0x01000000)) -or- 

((0x0100000-0) or (0x00000000-0x01000000)) 
to produce a false result as both (0x0100000 0) and 
(0x00000000 0x01000000), indicating an error reporting 



09/22/2003, EAST Version: 1.04.0000 



US 6,418,523 B2 

13 14 

and recovery procedure is needed. Moreover, for a Valid- the PTE, the GART base address 212 is combined at 213 

Mask set to 0x00000000 (i.e., valid bit disabled), the if with the contents of the virtual page number field 204 to 

condition always evaluates to true indicating that no errors obtain a PTE address 214. The PTE address 214 in turn 

are present. provides the physical page number 216 corresponding to the 

In a similar fashion, for a write transaction, the if condi- 5 virtual page number 204. However, at this point, a TLB 

tion: entry 242 is formed having a virtual page field 244, its 

((WriteMask 0) and ((PTE and WriteMask) WriteMask)) corresponding physical page field 246, a least recently used 

is evaluated. Note that the expression (WriteMask 0) indi- (LRU) counter 248 to determine the relative age of the TLB 

cates that the write mask is enabled and the expression entry 242 and a status indicator 250 to determine when the 

((PTE and WriteMask) WriteMask)) determines if the write io TLB 240 has valid information. The TLB entry 242 is stored 

bit of PTE(2) has already been set. Now, for PTE(2)= in a TLB 240 having a plurality of TLB entries 252. In one 

0x12345678 and WriteMask-Ox00200000, substitute these embodiment, there are a sufficient quantity of TLB entries 

values into the if condition* 252 to cover all of the translatable addresses in the entire 

((WriteMask 0) and ((PTE and WriteMask) WriteMask)) ^L™^ 184 , ( ! IG " 4 >: In J hi f S embodimen !> s yf tem l °g ic 

vv _ or _ 7 vv ' " 15 154 (FIG. 3) includes a block of registers to implement the 

~ " , , , rt TLB 240. In another embodiment, system logic 154 (FIG. 3) 

((0x00200000 0) and ((0x12345678 and 0x00200000) Mes a fct m portion ;J c h as cache SRAM, to 

0x00200000)) -or- implement the TLB 240. 

((0x00200000 0) and (0x00200000 0x00200000)) to pro- FIG. 6b illustrates the use of registers to provide direct 

duce a false result as 0x00200000=0x00200000. Thus, 20 rea d and write access to the TLB entries 252. In one 

the write bit does not need to be set. However, if embodiment, a TLB 240 operates as a memory cache for the 

PTE(2)=0xl21 45678, the if condition evaluates as fol- mos t recently used PTEs. In contrast, the interface of FIG. 

lows: 6b enables direct access of TLB entries 252 to reduce 

((WriteMask 0) and ((PTE and WriteMask) WriteMask)) latency and memory requirements. In this embodiment, 

-or- 25 control logic 232 receives a configuration bit from the 

((0x00200000 0) and ((0x12145678 and 0x00200000) processor 152 (FIG. 3) to disable the cache like operation of 

0x00200000)) -or- me TLB 240, thus enabling a direct access mode to the TLB 

((0x00200000 0) and (0x00000000 0x00200000)) 240 by software. In the direct access mode the 

to produce a true result as 0x00000000 0x00200000. Here, P r0CeSS °« P^ 0 ^ loa ^, a address uito Address 

the write bit for PTE(2) is set as the if condition indicates 30 ^ ster f 4 '. CoD l r °! to »? P r ° v ' d f es th< ! ^ ad f dre ^ ™ 

that a write has not occured on this page before and the Address Reg^ter 234 to the Mux 238 for selection of aTLB 

contents of PTE(2) are calculated as follows: referenced by the TLB address^In aread operation the 

imr dtc / • \/i TLB 240 returns the contents of the TLB entry referenced by 

PTE :«PTE or WriteMask; -or- (hc address to the Mux 33^ wn j cn m mrn passcs tne 

PTE(2):=0xl21 45678 or 0x00200000 -or- 35 contents of the TLB entry to the Data Register 236 for 
PTE(2):=0xl2345678 storage. The processor 152 (FIG. 3) then reads the Data 
and PTE(2)=0xl2345678 is written back to memory. Register 236 to obtain the contents of the desired TLB 
Lastly, for a read transaction, the if condition: address. In a write operation, the processor 152 (FIG. 3) 
((ReadMask 0) and ((PTE and ReadMask) ReadMask)) loads data to be written to the TLB 240 into the Data 
is evaluated. This pseudo-code operates in a substantially 40 Register 236. Control logic 232 provides the data in Data 
similar manner to the pseudo-code discussed above for the Register 236 to the Mux 238, which then passes the data to 
WriteMask. Note that the expression (ReadMask 0) indi- the TLB 240 for storage in the TLB entry referenced by the 
cates that the read mask is enabled and the expression ((PTE TLB address stored in Address Register 234. 
and ReadMask) ReadMask)) determines if the read bit of For example, suppose the processor 152 (FIG, 3) needs to 
PTE(2) has already been set. Now, for PTE(2>0xl2345678 45 update TLB(l) with the value 0x12345678 and verify stor- 
and ReadMask-0x00400000, substitute these values into the age of 0x12345678 in TLB(l). The processor 152 (FIG. 3) 
if condition: writes the TLB address corresponding to TLB(l) into the 
((ReadMask 0) and ((PTE and ReadMask) ReadMask)) Address Register 234 and the value 0x12345678 into Data 
_ or . Register 236. Control Logic 232 provides the contents of 
((0x00400000 0) and ((0x12345678 and 0x00400000) 50 * ddr f s ^ TLB(l) address, to the Mux 238 
f)x004f)nonnYk or f° r se l ecuotl of TLB(l). The Mux 238 then passes the value 
an na^rZfi m a /n nmnnnnn mmm™^ , 0x12345678 from Data Register 236 to the TLB 240 for 
((0x00400000 0) and i**™^ storage in TLB(l). To verify the write operation, the pro- 
tee a true result as 0x00000000-0x00400000 Thus ^ 1$2 (n ^ 3) now / xecutes a ^ commmd * M 
PTE(2 has not been read before and the value of 55 Address Register 23 4 still holds the TLB(l) address, control 
PTE(2) is calculated as follows: logic m ±t addfess ^ Mdrcs& Reg . 

PTE :-PTE or ReadMask; -or- 234 to the Mux 238 for selection of TLB(l). The TLB 

PTE(2):=0xl2345678 or 0x00400000 -or- 240 returns the contents of the TLB(l), 0x12345678, to the 

PTE(2):=0xl2745678 Mux 238, which in turn passes the value 0x12345678 to the 

and PTE(2)=0xl2745678 is written back to memory. 60 Data Register 236 for access by the processor 152 (FIG. 3). 

FIG. 6a illustrates the translation of a virtual address 200 In this manner, the embodiment of FIG. 6b provides a 

to a physical address 202 (FIG. 5a) using a translation look mechanism for indirect addressing, whereby individual TLB 

aside buffer (TLB) 240. As before, a virtual address 200 entries may be directly accessed. 

includes a virtual page number field 204 and an offset field FIG. 7 illustrates the operation of a TLB 240 to provide 

206. Translation of the virtual page number field 204 occurs 65 translation of a virtual address 200 to a physical address 202 

by finding a PTE of the GART table 210 corresponding to to retrieve the desired information from the main memory 

the contents of the virtual page number field 204. To identify 218. Tne TLB 240 comprises a plurality of TLB entries 252, 
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each entry having a virtual page field as described with 
reference lo FIG. 6a. To determine if a desired translation 
exists in the TLB 240, the contents of the virtual page 
number field 204 are compared at 253 to the contents of the 
virtual page fields of each of the plurality of TLB entries 252 5 
in the TLB 240. For example, the contents of the virtual page 
field 246 (FIG. 6a) of TLB entry 242 (FIG. 6a) are compared 
at 253 to the contents of the virtual page number field 204 
(FIG. 7) and no match is found. Upon finding a match, an 
index 254 corresponding to the matching TLB entry 255 is 10 
used to retrieve the contents of the matching TLB entry 255 
from the TLB 240. The contents of the physical page field 
256 of the matching TLB entry 255 are combined at 217 
with the contents of the offset field 206 of the virtual address 
200 to form the physical address 202, which references a 15 
location in main memory 218 holding the desired informa- 
tion. Note that a status indicator 262 of the matching TLB 
entry 255 indicates whether the contents of the physical page 
field 256 are valid and, if so, a LRU counter 260 is updated. 

Referring now to FIG. 8, a flowchart illustrates a method 20 
of using the present invention. At state 300, the system logic 
154 (FIG. 3) receives an AGP request for data referenced by 
a virtual address 200 (FIG. 6a). At state 302, the system 
logic 154 (FIG. 3) determines if the TLB 240 (FIG. 6a) has 
the requested virtual address 200 (FIG. 6a). If the requested 25 
virtual address 200 (FIG. 6a) is not in the TLB 240 (FIG. 
6a), the system logic 154 obtains the virtual to physical 
address translation from the GART table 210 (FIG. 6a) 
located in main memory 218 (FIG. 6a). At state 304, the PTE 
Address 214 (FIG. 6a) is generated by combining the GART 30 
base address 212 (FIG. 6a) with the contents of the virtual 
page number field 204 (FIG. 6a) of the virtual address 200 
(FIG. 6a). At state 306, the system logic 154 (FIG. 3) fetches 
a GART PTE 220 (FIG. 5b) corresponding to the PTE 
Address 214 (FIG. 6a) from the main memory 218 (FIG. 35 
6a). Upon retrieving the GART PTE 220 (FIG. Sb), the 
system moves to state 308 wherein a TLB entry slot 242 
(FIG. 6a) in the TLB 240 (FIG. 6a) is selected to store the 
physical translation information for the virtual address 200 
(FIG. 6a). The virtual to physical address translation pro- 40 
ceeds to state 310 as for the circumstance where the 
requested virtual address 200 (FIG. 6a) exists in the TLB 
240 (FIG. 6a). 

At state 310, the LRU counters 248 (FIG. 6a) of all TLB 
entries 252 (FIG. 6a) are updated to reflect the most recent 45 
access of the TLB 240 (FIG. 6a). At state 312, the physical 
address 202 (FIG. 7) corresponding to the virtual address 
200 (FIG. 7) is formed by combining the contents of the 
physical page field 256 (FIG. 7) with the offset 206 (FIG. 7) 
of the virtual address 200 (FIG. 7). At state 314, the System 50 
logic 154 (FIG. 3) then issues a memory request to retrieve 
the contents of the physical address 202 (FIG. 7). Lastly, the 
AGP request is completed at state 316. 

Referring now to FIG. 9, a flowchart illustrates one 
embodiment of a process for updating the LRU counters of 55 
all TLB entries 310 (FIG. 8). At state 320, the LRU counter 
for the selected TLB entry 242 (FIG. 6a) is saved for 
subsequent comparison to the LRU counters of each of the 
TLB entries. This comparison takes place at state 322. If the 
current TLB entry for comparison is determined to be the 60 
same as the selected TLB entry 242 (FIG. 6a) at state 324, 
the LRU counter of the selected TLB entry 242 (FIG. 6a) is 
set to the maximum value at state 326. Otherwise, the LRU 
counter of the TLB entry for comparison is decremented at 
state 328. In one embodiment, the LRU counter is decre- 65 
mented by one. Thus, when a TLB hit occurs, the LRU 
counter of the selected TLB entry 255 (FIG. 7) is loaded to 
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its maximum value and the LRU counters of all other TLB 
entries 252 (FIG. 7) are decremented. 

Referring now to FIG. 10, a flowchart illustrates one 
embodiment of a process for selecting a TLB slot 308 (FIG. 
8). At state 340, system logic 154 (FIG. 3) determines if a 
TLB slot is not currently used. If an unused TLB slot is 
found, this slot is selected at state 342 to store the physical 
translation information in the TLB entry 242 (FIG. 6a). 
Otherwise, the LRU counters of all TLB slots are compared 
at state 344. When the TLB entry having the minimum LRU 
counter value is found, this slot is selected at state 346 to 
store the physical translation information in the TLB entry 
242 (FIG. 6a). Lastly, at state 348, the status indicator 250 
(FIG. 6a) of the selected TLB slot is set to indicate a valid 
entry. 

Referring now to FIG. 11, a flowchart illustrates one 
embodiment of a process for fetching a GART PTE 306 
(FIG. 8). At state 360, the system logic 154 (FIG. 3) obtains 
the virtual page number 204 (FIG. 5a) from the virtual 
address 200 (FIG. 5a). At state 362, the virtual page number 
204 (FIG. 5a) is then combined with the GART base address 
212 (FIG. 5a) to form a PTE Address 214 (FIG. 5a). Lastly, 
at state 364, system logic 154 (FIG. 3) reads the PTE from 
main memory 218 (FIG. 5a) using the PTE Address 214 
(FIG. 5a). 

The present invention advantageously overcomes several 
limitations of existing technologies and alternatives. For 
example, current technologies store graphics data in expen- 
sive local frame buffer memory. In contrast, the present 
invention enables storing, addressing and retrieving graphics 
data from relatively inexpensive main memory without the 
bandwidth limitations of current system bus designs. 
Furthermore, by defining the GART in software, the present 
invention eliminates many hardware dependencies. Instead 
of expensive circuit redesigns and fabrication, the present 
invention enables inexpensive software modifications to 
address future partitioning and remapping circuitry as well 
as any current or future compatibility issues. Moreover, the 
present invention enables computer manufacturers to inves- 
tigate cost and performance compromises at the system 
integration stage rather than at the hardware design and 
development stage. For example, computer manufacturers 
may implement the entire GART in main memory (instead 
of registers) to reduce register costs, while caching an image 
of the most recently used GART entries in a few registers to 
reduce access times to main memory. The invention thus 
provides substantial flexibility to address ever changing cost 
and performance requirements well after the completion of 
the hardware design. In contrast to existing hardware design 
paradigms, the present invention enables rapid and inexpen- 
sive modifications to address evolving customer and market 
needs. 

In addition, the present invention is useful for computer 
system applications that flexibly allocate memory resources 
which are tightly coupled to the computer hardware. For 
example, the invention is useful in situations where hard- 
ware ascertains and reports state information, such as diag- 
nostic data or vital product data. The invention allows for 
flexible reporting of the state information under software 
control, instead of hardware control where functions are 
hardwired into circuitry. Similarly, the invention provides 
alternate mechanisms to access internal registers for diag- 
nostic purposes. Lastly, the invention provides a mechanism 
whereby status can be flexibly programmed into memory. 
Thus, the invention enables any device, such as a network 
device broadcasting bits in a serial stream, to flexibly 
indicate status information using a medium other than 
memory. 
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The invention may be embodied in other specific forms 
without departing from its spirit or essential characteristics. 
The described embodiment is to be considered in all respects 
only as illustrative and not restrictive and the scope of the 
invention is, therefore, indicated by the appended claims S 
rather than by the foregoing description. All changes which 
come within the meaning and range of equivalency of the 
claims are to be embraced with their scope. 

What is claimed is: 

1. An apparatus for graphic address remapping of a virtual 10 
address, comprising: 

a processor; 

an interface that is accessible by the processor; and 
a translation lookaside buffer (TLB) in communication 
with the interface, the TLB having at least one TLB 
entry including information which is used to translate 
the virtual address to a physical address; 
wherein the interface receives a portion of the virtual 
address and provides access to the TLB entry corre- 
sponding to the portion of the virtual address, wherein 
the TLB entry includes translation information from a 
graphics address remapping table that contains location 
information of a plurality of physical pages of memory 
that are used to store graphics data, wherein the pro- 
cessor is capable of modifying the contents of the TLB 
via the interface. 

2. The apparatus of claim 1, wherein the interface pro- 
vides read access to the TLB entry. 

3. The apparatus of claim 1, wherein the interface pro- 
vides write access to the TLB entry. 

4. The apparatus of claim 1, wherein the interface further 
comprises: 

a data register; 

an address register receiving a portion of the virtual 
address; and 

a multiplexer in communication with the address register, 
the TLB and the data register, wherein the multiplexer 
selects the TLB entry corresponding to the portion of 
the virtual address and provides access to the selected 
TLB entry using the data register. 

5. The apparatus of claim 1, wherein the portion of the 
virtual address comprises a virtual page number field. 

6. The apparatus of claim 1, wherein the at least one TLB 
entry further comprises a least recently used (LRU) counter. 

7. The apparatus of claim 1, wherein the at least one TLB 
entry further comprises a status indicator to indicate if the 
TLB entry is valid. 

8. The apparatus of claim 1, wherein the virtual address 
includes a virtual page number field and an offset field. 

9. An apparatus for graphic address remapping of a virtual 
address, comprising: 

a processor; 

an interface that is accessible by the processor, and 5S 
a translation lookaside buffer (TLB) in communication 
with the interface, the TLB having at least one TLB 
entry including information which is used to translate 
the virtual address to a physical address; 
wherein the interface receives a portion of the virtual 60 
address and provide access to the TLB entry corre- 
sponding to the portion of the virtual address, wherein 
the TLB entry includes translation information from a 
graphics address remapping table that contains location 
information of a plurality of physical pages that are 65 
used to store graphics data, wherein the TLB includes 
at least one TLB entry for each physical page of 
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memory that is managed by the graphics address 
remapping table. 

10. The apparatus of claim 9, wherein the interface 
provides read access to the TLB entry. 

11. The apparatus of claim 9, wherein the interface 
provides write access to the TLB entry. 

12. The apparatus of claim 9, wherein the interface further 
comprises: 

a data register; 

an address register receiving a portion of the virtual 
address; and 

a multiplexer in communication with the address register, 
the TLB and the data register, wherein the multiplexer 
selects the TLB entry corresponding to the portion of 
the virtual address and provides access to the selected 
TLB entry using the data register. 

13. The apparatus of claim 9, wherein the portion of the 
virtual address comprises a virtual page number field. 

14. The apparatus of claim 9, wherein the at least one TLB 
entry further comprises a least recently used (LRU) counter. 

15. The apparatus of claim 9, wherein the at least one TLB 
entry further comprises a status indicator to indicate if the 
TLB entry is valid. 

16. The apparatus of claim 9, wherein the virtual address 
includes a virtual page number field and an offset field. 

17. An apparatus for graphic address remapping of a 
virtual address, comprising: 

a processor; 

an interface that is accessible by the processor; and 
a translation lookaside buffer (TLB) in communication 
with the interface, the TLB having at least one TLB 
entry including information which is used to translate 
the virtual address to a physical address; 
wherein the interface receives a portion of the virtual 
address and provides access to the TLB entry corre- 
sponding to the portion of the virtual address, wherein 
the TLB entry includes translation information from a 
graphics address remapping table that contains location 
information of a plurality of physical pages that are 
used to store graphics data, wherein size of the graphics 
address remapping table is configurable by a program 
that is executing on the processor. 

18. An apparatus for graphic address remapping of a 
virtual address, comprising: 

a processor; 

an interface that is accessible by the processor; and 
a translation lookaside buffer (TLB) in communication 
with the interface, the TLB having at least one TLB 
entry including information which is used to translate 
the virtual address to a physical address; 
wherein the interface receives a portion of the virtual 
address and provides access to the TLB entry corre- 
sponding to the portion of the virtual address, wherein 
the TLB entry includes translation information from a 
graphics address remapping table that contains location 
information of a plurality of physical pages that are 
used to store graphics data, wherein size of the graphics 
address remapping table is configurable by a program 
that is executing on the processor, and wherein TLB 
includes at least one TLB entry for each physical page 
of memory that is managed by the graphics address 
remapping table. 
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[57] ABSTRACT 

A method and system for address translation mapping of 
logical partitions for address translation buffer entries in a 
data processing system is provided. The method comprises 
receiving a logical address for a memory reference to a 
selected logical partition of a plurality of logical partitions of 
a particular block of virtual memory, wherein the block of 
virtual memory is divided into the plurality of logical 
partitions, and wherein the logical address includes a plu- 
rality of logical partition selection bits selecting the selected 
logical partition from among the plurality of logical parti' 
tions. If the selected logical partition is valid in real memory, 
as indicated by a logical partition valid bit associated with 
the selected logical partition, a physical address for the 
memory reference in the selected logical partition is com- 
piled from an entry of an address translation buffer that is 
associated with the particular block of virtual memory, 
wherein the logical partition valid bit is one of a plurality of 
logical partitions valid bits contained in the entry associated 
with the particular block of virtual memory, the plurality of 
logical partition valid bits being associated with the plurality 
of logical partitions. Thereafter, the memory reference 
within the selected logical partition is retrieved at the 
compiled physical address. 

22 Claims, 2 Drawing Sheets 
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VIRTUAL MEMORY MAPPING METHOD 

AND SYSTEM FOR ADDRESS 
TRANSLATION MAPPING OF LOGICAL 
MEMORY PARTITIONS FOR BAT AND TLB 
ENTRIES IN A DATA PROCESSING SYSTEM 

CROSS-REFERENCE TO RELATED 
APPLICATION 

The present application is related to U.S. Pat. Application 
Ser. No. 08/571,066, entitled "Virtual Memory Mapping 
Method and System For Memory Management Of Pools Of 
Logical Partitions For BAT And TLB Entries In A Data 
Processing System", filed herewith by the Inventors hereof 
and assigned to the assignee herein. 

BACKGROUND OF THE INVENTION 

1. Technical Field 

The present invention relates in general to a virtual 
memory mapping system in a data processing system, and in 
particular to an improved virtual memory mapping system in 
a data processing system having cached address translation 
mapping of memory references. Still more particularly, the 
present invention relates to an improved virtual memory 
mapping system in a data processing system having fine 
granularity of cached address translation mapping of 
memory blocks. 

2. Description of the Related Art 

A computer system typically includes a processor coupled 
to a hierarchically staged storage system. The computer's 
hierarchy of storage devices comprises primary memory mat 
includes internal components such as the CPU registers, 
cache memory, and main memory, and secondary memory 
that includes any external storage devices such as disks or 
tapes. Main memory is typically a DRAM or a SRAM 
Computers often use an intermediate high-speed buffer 
called a cache memory that resides between the external 
devices and main memory or between main memory and the 
CPU. Cache memories speed up the apparent access times of 
the slower memories by holding the words that the CPU is 
most likely to access. The hardware can dynamically allo- 
cate parts of the memory within the hierarchy for addresses 
deemed most likely to be accessed soon. 

Most computers use a multilevel storage system that 
operates as a virtual memory. In such systems, most pro- 
grams are stored on an external device, such as a hard disk. 
In practice, the logical-address space of many computers is 
much larger that their physical-address space in main 
memory. For example, if a byte-addressed computer uses a 
32-bit address, its logical address space has 2 32 memory 
locations , which is four gigabytes (GB). The operating 
system loads the program into the main memory in parts or 
pages, as demanded for execution. By using virtual memory 
paging, the computer loads into main memory only those 
parts of a program that it currently needs for execution. The 
remainder of the program resides in external storage until 
needed. Thus, one of the biggest advantages of virtual 
memory is that because programs are stored on secondary 
storage devices, the size of a program that may be executed 
is limited not by the size of main memory but rather by the 
size of the computer's logical-address space. 

Because various blocks of memory may be stored 
throughout the memory hierarchy, a program's logical 
addresses to particular instructions or data may no longer 
correspond to the physical addresses for the particular block 
of memory containing those instructions or data. In a virtual 
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memory system, the operating system maintains special 
tables that keep track of where each part of the program 
resides in main memory and in external storage. The 
memory map between logical-address space and physical- 

5 address space is maintained in a page frame or block table 
having a plurality of entries, each table entry holding infor- 
mation about a specific page or block of memory. Thus, the 
CPU uses address translation mapping from the tables to 
translate the program's effective (or logical) addresses into 
their corresponding physical addresses. Most virtual 
memory systems keep these translation tables in main 
memory, and maintain a translation table base register that 
points to the translation table in memory. Depending upon 
the configuration, separate tables may exist for the block 
table and the page frame table. 

15 Virtual memory hardware divides logical addresses into 
two parts — the virtual block (page) number (the high-order 
bits), and the word offset (the low-order bits). The virtual 
block (page) number serves as an offset into the block (or 
page frame) table. Therefore, when the system loads a block 

20 (or page) into memory, it always places the block beginning 
at a block boundary. A typical table entry includes a validity 
bit, which indicates whether the block is in main memory, a 
dirty bit, which indicates whether the program has modified 
the block, protection bits, which indicate which users may 

25 access the page or block of memory and how, and the 
page-frame or real block number (i.e., the physical address) 
for the block of memory, if the block (or page) is in main 
memory. 

To minimize the amount of time required to translate a 

30 virtual address to a real address, virtual memory mapping 
generally uses address translation buffers to cache informa- 
tion for recently translated pages. Each entry of an address 
translation buffer holds a real block number and the same 
information contained in the translation table, including the 

35 validity bit for the block, one or more dirty bits, protection 
bits, and the virtual block number to provide the map of the 
virtual block number to the real block number. In paging 
virtual memory systems, mis address translation buffer is 
called a translation lookaside buffer (TLB). In addition, 

40 some virtual memory systems provide an additional address 
translation buffer called a block address translation (BAT) 
buffer, which includes an additional field indicating the size 
of the block of memory mapped by the entry to support 
variable sized blocks. 

45 In operation, whenever the CPU generates an effective 
address, it is sent to the TLB and the BAT, which produce 
the real page frame or BAT block number, if either buffer 
holds an entry for the referenced block of memory. If one of 
the address translation buffers has an entry for the reference, 

50 the effective address is translated into the physical address 
by concatenating the real block number held in the entry 
with the word offset of the effective address. If the TLB or 
BAT has no entry for the referenced block of memory, the 
hardware (or software) consults the translation table in main 

55 memory by using the virtual block number as an offset into 
the translation table. If the validity bit for the entry in the 
translation table indicates the block is in memory, the 
hardware copies the translation table entry and uses the real 
block number to access the memory into the TLB. 

6o Otherwise, the hardware initiates a trap called a page or BAT 
block fault, at which point the operating system intervenes 
to load the demanded block of memory into main memory 
and updates the translation table and address translation 
buffers. A block or page fault is an exception that instructs 

65 the operating system to load into main memory the 
requested or demanded block or page and to update (he 
memory map. 
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In many systems, the page size is 4 kilobytes (4 KB) and • Variable sized blocks, as are used in BATs, require more 

there are 256 or fewer TLB entries. Consequently, the advanced memory management techniques from the oper- 

maximum amount of real storage covered by cached page ating system than do uniform sized pages. When a BAT 

translation informatioD is often 1 MB or less. Consequently, fault occurs, the operating system must rind (or create) an 

it is unlikely that TLBs far 4 KB pages will ever cover 5 available block of contiguous real storage for the entire 

significant portions of large (5 12 MB-4 GB) main memories. block, a process that may require the removal of some 

These TLBs are even inadequate to prevent significant blocks already in memory. 

performance degradation due to TLB misses while accessing ^ Placing differing sized blocks in memory can leave 

data which fits in a large (4 MB-16 MB) Level Two (L2) fragments or splinters, i.c, small blocks of memory 

cache. As real memory capacities, program footprints, and 1Q between other allocated blocks that are too small to be 

user working sets continue to grow, it is beneficial to used by other blocks, creating inefficiency, 

increase the amount of real memory covered by cached ft can be seen that there is a need for a virtual memory 

translation information. Three common approaches to ma ^ n syst£m mat provides translation coverage for large 

mcreasingme coverage are 1) mcreasmg Ae niim^ of bloc^T of storage, while still providing sufficiently fine 

entries, 2) supporting larger pages, and 3) adding BAT ^ t0 ^ input/output (I/O) efficiency and 

facilities to augment the TLBs^ reduce memory allocation problems. Such a virtual memory 

Increasing me number of TLB entoes mapping system would dramatically increase the coverage 
both m terms of chip area ^^.^^^^ ofclhedtoslation information while avoiding the prdb- 
S*ur^ lernsass^with^ 
the L2 cache but large portions of main memory; however, 20 SUMMARY OF THE INVENTION 
me larger granularity of storage blocks and memory map- 
ping creates great inefficiencies. According to the present invention, a method and system 

BAT facilities are an alternative, and more common way for address translation mapping of logical partitions for 
of providing many of the coverage benefits of large pages. address translation buffer entries in a data processing system 
BAT registers, which are set by the operating system, in 25 [ s provided The method comprises receiving a logical 
contrast to TLB entries, which are typically reloaded by address for a memory reference to a selected logical parti- 
hardware, specify the translation for a block of storage tton of a plurality of logical partitions of a particular block 
which is large relative to a 4 KB page. Unlike TLB entries Q f virtual memory, wherein the block of virtual memory is 
which typically translate a fixed-size block (a page) of divided into the plurality of logical partitions, and wherein 
storage, bits in the BAT registers allow a specification of a 30 th e logical address includes a plurality of logical partition 
range roughly equivalent to 50-5000 (4 KB) pages. Both selection bits selecting the selected logical partition from 
large pages and BAT blocks must be on a boundary similar among the plurality of logical partitions. If the selected 
to the size of the page or block in both address spaces (Le., logical partition is valid in real memory, as indicated by a 
a 2 MB block must be on a 2 MB boundary). While large logical partition valid bit associated with the selected logical 
pages or large BAT blocks can alleviate the pressure for 35 partition, a physical address for the memory reference in the 
more TLB entries, the larger blocks of memory create selected logical partition is compiled from an entry of an 
operating system complexities and granularity issues that address translation buffer that is associated with the particu- 
create great disadvantages for the memory mapping system. lar block of virtual memory, wherein the logical partition 
The major disadvantages of such large blocks are: valid bit is one of a plurality of logical partitions valid bits 

• When a BAT fault (ie., a "page fault" for a portion of 40 contained in the entry associated with the particular block of 
memory which will be covered by a BAT entry) is virtual memory, the plurality of logical partition valid bits 
encountered, the faulting process is suspended until all being associated with the plurality of logical partitions, 
data for the large block has been brought from disk to Thereafter, the memory reference within the selected logical 
memory and the translation is made valid for the block. partition is retrieved at the compiled physical address. 
Longer waits are associated with larger blocks. 45 .„^ T „ 

• While an entry is valid, all data must be present Even if BRIEF DESCRIPTION OF THE DRAWINGS 
only a small portion of the data (or instructions) is needed, ^ nove] ^^55 believed characteristic of the invention 
the entire block of main storage is allocated (and unavail- afC ^ forth ^ ^ appended claims. However, the invention, 
able for other uses). Hardware complexity of BATs results as wcll ^ a preferred mode of use, and further objects and 
in implementations with only a few BAT entries. In 50 ^y^^ges thereof, will best be understood by reference to 
multi-user .systems with hundreds or thousands of pro- ^ ^(Hvuig detailed description of an illustrative embodi- 
cesses and where memory is a limited resource, operating ment ^ m conjunction with the accompanying 
systems are reluctant to allocate large contiguous blocks drawings, wherein: 

of real memory to each user process. Therefore, it is memory mapping system of a 
expected mat operatmg systems ^»«*^J*™ 55 JSi S^ntTtJe preset Sol; and 

™* * flBed) U>beUSedf0r FIG. 2 shows a block diagram of an Address Translation 

• ^difficult to provide coverage for areas which are not Buffer, in accordance with a preferred embodiment of the 
a power-of-two bytes because blocks must be stored on a present invention. 

power-of-two boundary. For example, to cover a 7 MB 60 DETAILED DESCRIPTION OF THE 

block, there are two choices: 1) Use an 8 MB entry and PREFERRED EMBODIMENT 

waste 1 MB of real storage. (This may not be permissible t 

since the user can access storage beyond the expected 7 With reference now to the figures, and in particular with 

MB limit). 2) Use three BAT entries (4 MB+2 MB+1 reference to FIG. 1, there is depicted the virtual memory 
MB) This is a serious burden for current BAT hardware 65 mapping system of a preferred embodiment of the present 

as most implementations have only a few (2-8) BAT invention. CPU 10 executes programs stored ^ logical 

entries address space of direct access storage device (DASD) 22. 
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When an instruction executed in CPU 10 generates^ a with a faster access time to hold the memory references most 

memory reference to an instruction (or data) not contained recently used As will be appreciated. Address Translation 

within the CPU registers 13, the memory reference must be Buffer 16 would make a request for the translated memory 

retrieved from the data processing system's memory hier- access to the cache memory, so that if the cache holds a copy 

archy. s of the requested data, the cache will quickly process the 

The instruction's logical address 12 addresses a block of request Otherwise, the cache forwards the request to main 

memory within the logical address space of the virtual memory 20. 

memory system. The block of memory addressed by the Referring now to FIG. 2, there is shown a block diagram 
logical address is partitioned into a plurality of logical of Address Translation Buffer 16, in accordance with a 
partitions. An effective address is computed from the logical 10 preferred embodiment of the present invention. Address 
address by the effective-address computation hardware 14. Translation Buffer 16 has eight BAT registers. Each BAT 
During program execution, the effective-address computa- register contains a BAT entry. As will be appreciated by 
□on hardware 14 converts the instruction's address specifi- those skilled in the art, the present invention is also appli- 
cations into effective addresses. These are the addresses that cable to large page TLB entries having partitioned pages and 
the CPU uses when referencing an instruction or variable, 15 the operation of such an embodiment would be substantially 
and in general they agree with the compiler* s logical address similar to the present embodiment, except for the variable 
(as will be assumed herein). sized blocks for the BAT entries, and so such a description 

CPU 10 sends the effective address to Address Translation is not repeated herein. FIG. 2 shows a single ATB register 24 
Buffer (ATB) 16. Address Translation Buffer 16 contains and associated hardware 26-36. TLB 15 and the remaining 
TLB 15 and BAT 17. Both the TLB and the BAT are 20 seven registers and associated hardware of Address 
searched for an entry for that effective address. According to Translation Buffer 16 arc not shown. As will be appreciated, 
the present invention, a memory reference for a particular Address Translation Buffer 16 can be equipped to provide 
instruction may be to a selected logical partition of the block address translation caching for any number of entries, 
of virtual memory addressed by the logical address. A including eight, for a preferred embodiment 
logical address includes a virtual block number addressing 25 As shown in FIG. 2, each BAT entry of Address Trans- 
the block of memory in virtual memory space, a plurality of lation Buffer 16 contains a block Valid bit, block Size, 
logical partition selection bits selecting a partition of the effective address (TAG), real address (RA), and a plurality 
block of virtual memory, and a real address offset In the of partition valid tats (Partition Valids). The block Valid bit 
present embodiment only block entries of BAT 17 may be specifies whether the contents of the BAT register represent 
logically partitioned, but it will be understood by those 30 a valid address translation for the referenced block of 
skilled in the art that the present invention is applicable to memory. If the block of memory referenced by the TAG in 
any ATB entry, such as large page TLB entries. If Address the entry is stored in main memory at the real address (RA), 
Translation Buffer 16 has a valid entry for that effective the block valid bit is set. The block Size is an encoding of 
address, it generates the corresponding physical address for the size, in bytes, of the block of memory translated by the 
the memory reference, which indicates the location of the 35 current entry. While the implemented TAG and RA fields 
memory reference within main memory 20. The physical need a sufficient amount of bits to handle the smallest 
address is transferred from Address Translation Buffer 16 to allowable block, the number of bits used for a given trans- 
main memory 20, thereby accessing the memory reference lation is determined by the block size. For a block size of 2* 
in the physical address space and loading it from main bytes, the TAG field contains the address of the first byte of 
memory 20 to CPU registers 13. 40 mc block in the effective/virtual address space (i.e., the 

If Address Translation Buffer 16 has no entry for the virtual block number) for the translated block of memory, 

effective address, the hardware (or software) consults the with the low-order N address bits removed. The RA field 

block (or page frame) table 18, using the virtual block contains the address of the first byte of the block in the real 

number as an offset into the block table. (Although shown address space of the main memory (i.e., the real block 

separately, block table 18 typically resides within main 45 number), with the low-order N address bits removed, 

memory 20.) Block table 18 contains many more entries Although the Address Translation Buffer supports multiple- 

than Address Translation Buffer 16, but provides a much size blocks, in the preferred embodiment, a 1 MB BAT block 

slower access time because it is contained in the slower main size is assumed 

memory 20 and has a larger number of entries to search In a preferred emtodiment the ATB entry has eight 

when making a comparison with the effective address. If the so partition valid bits, each corresponding to one of eight 

validity bit for the entry in the block table 18 corresponding separate partitions of the 1 MB block of virtual memory 

to the virtual block number indicates the block is in memory, translated by the particular ATB entry. Each partition bit is 

the hardware copies the block table entry into a new entry in set when its corresponding partition of the block of memory 

the Address Translation Buffer 16 and uses the real is stored in the main memory. In accordance with the present 

(physical) block number in the block table entry to access 55 invention, a memory reference to a particular partition of a 

the addressed data in main memory 20. Otherwise, the block of virtual memory will only result in the particular 

search of the block table for a memory reference to a block partition of the block of memory being stored to main 

of memory that is not present in main memory 20 results in memory. If a valid partition is stored in main memory, its 

a block (or page) fault A block fault is an exception that corresponding partition valid bit in the partition Valids field 

instructs the operating system to load into main memory the 60 of the ATB entry is set If that partition becomes invalid or 

requested or demanded block and to update the block map is overwritten, the corresponding partition valid bit Is reset 

(Le M create an entry in the block table 18 and address As seen in FIG. 2, an untranslated effective address 

translation buffer 16). The retrieved data is then loaded from EAO-31 is presented to ATB 16. This untranslated effective 

main memory 20 to CPU 10 to satisfy the memory access. address is simultaneously presented to all (appropriate) ATB 

As will be appreciated, the memory hierarchy shown in 63 entries in the ATB 16. In this embodiment, the effective and 

FIG. 1 may also include a cache memory between main real addresses for a block of memory are 32-bit addresses, 

memory 20 and CPU 10 that provides a high-speed RAM The upper twelve bits (EA 0-11), which specify a 1 MB 
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w~* nfth,- effective address are compared by comparators created within Address Translation Buffer 16 to satisfy a 

Mock c^responding to this ATB entry. If these addresses backed by real storage in the num meraary a«daj ^ 
mahl^Kffis output to ANDgate 28, which forms entry is created wtt oriy the o« ^"U c^esp^dmg to 

address bits RAIMI rnainmemory and the partition's valid bit would be setinthe 

Simultaneously, a valid bit from the ei^t p^tioa valia ™ For additional perfomiance, the operating system 

Hts (Partition Va^^ 10 £J «S S eXahead^oiial partial of the block 

memory referenced by the effe^ve^addressis vahd in fce ^^^J^ or after the immediate memory refer- 

main memory. The high-order effective address bits within ^ * 

the block (U.> the high-order bits addressing a particular Mwmbt appreciated by those skilled in the art, signifi- 

partition within the block of memory corresponding to the ^ adyanta ^ m created by the present invention's 

ATB entry) are used to select the partition and its valid bit 15 method and of introducing logical partitions for a 

within the partition valid bits. Within the 1 MB block / ^ cnfry mc foga^ partitioning of the address 

addressed by the first twelve effective address bits, the translation mapping can be provided in either existing BAT 

high-order effective address bits EA 12-14 select a particu- OT technologies. Additionally, the response time of I/O 

lar partition of the eigfrt partitions of the block. The eight for a fautt ^ decreased because only a subset, a single 

partition valid bits are input into multiplexer 30 and the ^ partitioIL of the block is required. The faulting process may 

high-order effective address bits EA 12-14 are used to index thus resume after a portion of the data (or instructions) has 

the selected partition valid bit as an output from multiplexer been retrieved, rather than waiting for the entire block 

30 The output of multiplexer 30 is input into ANDgate 28 covered by the ATB entry. Further, the amount of contiguous 

as 'another input into the last stage of the tag comparison real storage required to immediately satisfy a ATB fault is 

function, as wfll be understood by those skilled in the ait M reduced to the size of the pardon. Au^alf* toe real 

to« has been a match of the virtual block number for storage associated with non-vaHd lotions can be made 

th TrfcSveTlS Ss and the virtual block number available to other processes until needed by ^ process, 

tne rcceivea logical aa^csb partition the other portions of the block are not accessed during some 

^lo bS^ved^S l£J*££T2 SLrval. agonal savings may result by eliminating unne c- 

selection bits in tne received ejected * essarv I/O. Still further, because the granularity of real 

S°on £ inte JSS, V£5* 5eS^S32 Si win* must for a particular memory 

SK2£2 y^httS^rSstcrs arc connected access is reduced, address transUt.cn caching of user 

set, locating an al b djk./vu ^ address space is more acceptable to the system. Last, the 

nraduces an ATB hit signal, an ATB fault occurs. reserved for mis ATB entry. 

t^ b^S^y those skilled in the at. the Although the present invention has been described to 
Ccal^o ^^S 1 within the effective address 45 tarns of ^J^^^*^™^! 

block of rnemory and die bio* «ub. gjK tttoejt applicable to other sizes, such as two or four 

SraHr5sS£=s »r:r sl-zsl*! sat 

SLTarec^veiSe^Es Ta fuZ of the block features and advantages offce present ^enh°n re^s 
mis or a receiveu cu«uve V. ^ ntro i n f niulti- that the operating system perform the manipulations of tne 
^t^^^T^^Z- » Seal p£u» £U ™ ^ £££ 

STu^notoS meRA fields be understood by those skUled in the art thai variouschanges 

an^ATO nTtTnal dnce the Andgate 28 can be «o in form and detail may be made therein without departing 
Sle^as oUSnooTbit in the^mparators 26. from me sp*t and scope of the invent™. 

As can be seen, the present Invention provides a method T^^SSti translation mapping of logical 

and system for Refining ^V*^**^^* ao^S^slation buffcTe^iJ in a data 

memory by introducing logical partitions of the block and P™"? me comprising the steps of : 

Serial, an endreATO entry If a new ATB entry must be sel^d logical partition among a pluralrty of logical 
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partitions of a particular block of virtual memory, 
wherein the block of virtual memory is divided into the 
plurality of logical partitions, and wherein the logical 
address includes a plurality of logical partition selec- 
tion bits selecting the selected logical partition from 
among the plurality of logical partitions; 

if the selected logical partition is valid in real memory, as 
indicated by a logical partition valid bit associated with 
the selected logical partition, compiling a physical 
address for the memory reference in the selected logical 
partition from an entry of an address translation buffer 
that is associated with the particular block of virtual 
memory regardless of whether logical partitions of said 
particular block other than said selected logical parti- 
tion are valid within said real memory, wherein the 
logical partition valid bit is one of a plurality of logical 
partitions valid bits contained in the entry associated 
with the particular block of virtual memory, the plu- 
rality of logical partition valid bits each being associ- 
ated with a respective one of the plurality of logical 
partitions; and 

retrieving the memory reference within the selected logi- 
cal partition at the compiled physical address. 

2. A method for address translation mapping of logical 
partitions far address translation buffer entries in a data 
processing system according to claim 1, wherein a logical 
address includes a virtual block number for the particular 
block of virtual memory, and wherein each entry includes a 
virtual block number for an associated block of memory and 
a real block number addressing the associated block in real 
memory, and wherein an entry is associated with a particular 
block of virtual memory when the virtual block number for 
the particular block of virtual memory and the virtual block 
number for the entry are equal 

3. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 1, wherein the step of 
compiling comprises concatenating the real block number 
with the logical partition selection bits and a real address 
offset to form a physical address. 

4. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 1, further comprising 
the steps of: 

if said plurality of logical partition valid bits indicate that 
none of said plurality of logical partitions of said 
particular block of virtual memory is valid within said 
real memory, storing into the real memory the selected 
logical partition for the received logical address, and 
creating an entry in the address translation buffer for the 
particular block of memory addressed by the received 
logical address. 

5. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 1, further comprising 
the steps of: 

If said plurality of logical partition valid bits indicate that 
at least one logical partition of said particular block 
other than said selected logical partition is valid within 
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processing system according to claim 4, and further com- 
prising the step of accessing said selected logical partition in 
the real memory prior to storing, into said real memory, 
another of said plurality of logical partitions in the block of 
memory addressed by the received logical address. 

7. A system for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system, said system comprising: 

means for receiving a logical address for a memory 
reference to a selected logical partition among a plu- 
rality of logical partitions of a particular block of 
virtual memory, wherein the block of virtual memory is 
divided into the plurality of logical partitions, and 
wherein the logical address includes a plurality of 
logical partition selection bits selecting the selected 
logical partition from among the plurality of logical 
partitions; 

means, responsive to the selected logical partition being 
valid in real memory, as indicated by a logical partition 
valid bit associated with the selected logical partition, 
for compiling a physical address for the memory ref- 
erence in the selected logical partition from an entry of 
an address translation buffer that is associated with the 
particular block of virtual memory regardless of 
whether logical partitions of said particular block other 
than said selected logical partition are valid within said 
real memory, wherein the logical partition valid bit is 
one of a plurality of logical partitions valid bits con- 
tained in the entry associated with the particular block 
of virtual memory mat are each associated with a 
respective one of the plurality of logical partitions; and 

means for retrieving the memory reference within the 
selected logical partition at the compiled physical 
address. 

8. A system for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 7, wherein a logical 
address includes a virtual block number for the particular 
block of virtual memory, and wherein each entry includes a 
virtual block number for an associated block of memory and 
a real block number addressing the associated block in real 
memory, and wherein an entry is associated with a particular 
block of virtual memory when the virtual block number for 
the particular block of virtual memory and the virtual block 
number for the entry are equaL 

9. A system for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 7, wherein the means 
for compiling comprises means for concatenating the real 4 
block number with the logical partition selection bits and a 
real address offset to form a physical address. 

10. A system for address translation mapping of logical 
partitions far address translation buffer entries in a data 
processing system according to claim 7, further comprising 
means, responsive to said plurality of logical partition valid 
bits indicating that none of said plurality of logical partition 
of said particular block is valid within the real memory, for 
storing into the real memory the selected logical partition for 
the received logical address, and creating an entry in the 



the real memory, storing into the real memory the 60 address translation buffer for the block of memory addressed 



selected logical partition addressed by the received 
logical address, and updating the plurality of logical 
partition valid bits in the entry corresponding to the 
particular block of memory to indicate that the selected 
logical partition is contained in the real memory. 
6. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 



65 



by the received logical address. 

1L A system for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 7, further comprising 
means, responsive to said .plurality of logical partition valid 
bits indicating that at least one logical partition of said 
particular block other than said selected logical partition is 
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valid within the real memory, for storing into the real 
memory the selected logical partition addressed by the 
received logical address, and updating the plurality of logi- 
cal partition valid bits in the entry corresponding to the block 
of memory addressed by the received logical address to 
indicate that the selected logical partition is contained in the 
real memory. 

1Z A system for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 7, said system further 
comprising means far accessing said selected logical parti- 
tion in the real memory prior to storing, into said real 
memory, another of said plurality of logical partitions in the 
block of memory addressed by the received logical address. 

13. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system, the method comprising the steps of: 

receiving a logical address for a memory reference to a 
selected logical partition of a particular block of virtual 

Z.i i:- u.- n .^ m .1of KWIr nf virtual 
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processing system according to claim 13, when the selected 
logical partition is not indicated as valid in an entry having 
a match with the virtual block number of the received logical 
address, further comprising the steps of storing into the real 
memory the block of memory addressed by the received 
logical address, and updating the plurality of logical parti- 
tion valid bits in the entry corresponding to the block of 
memory addressed by the received logical address to indi- 
cate that the selected logical partition is co ntain ed in the real 
memory. 

17. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 13, wherein the step of 
combining is performed by concatenating the real block 
number with the plurality of logical partition selection bits 
and a real address offset to form a physical address for the 
selected partition of the memory reference. 

18. A virtual memory mapping system for translating a 
logical address for a memory reference to a block of virtual 



memory, 



igicai partition o* * paruvuw** uw»-». »-»» ▼ ^ ouu»w<? *v» » -~ — — - — 

wherein the particular block of virtual 20 memory into a physical address for the memory reference in 

* 1 r 1 1 n »k<H.*in tti» hlrw>1r rtf virtual mMnnrv is divided 
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memory is divided into a plurality of logical partitions, 
and wherein a logical address includes a virtual block 
number addressing the block of memory in virtual 
memory space, a plurality of logical partition selection 
bits selecting the selected logical partition within the 25 
block of memory, and a real address offset; 
detenmning if there is a match between the virtual block 
number of the received logical address and a virtual 
block number for any entry in an address translation 
buffer, wherein each entry in the address translation 
buffer is associated with a block of virtual memory and 
includes a virtual block number, a real block number 
addressing the block of memory in real memory, and a 
plurality of logical partition valid bits, wherein a logi- 
cal partition of the block of virtual memory is indicated 35 
as valid in real memory by an associated logical 
partition valid bit [oflamong the plurality of logical 
partition valid bits; and 
if there is a match of the virtual block numbers for the 
received logical address and an entry in the address 40 
translation buffer and if the selected logical partition is 
indicated as valid by the associated logical partition 
valid bit, combining the real block number with the 
logical partition selection bits and the real address 
offset to form a physical address in real memory for the 45 
memory reference in the selected logical partition, 
regardless of whether logical partitions of said particu- 
lar block other than said selected logical partition are 
valid within said real memory. 

14. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 13, further comprising 
the step of retrieving the memory reference in the selected 
logical partition addressed by the formed physical address, 

15. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
processing system according to claim 13, if there is not a 
match of the virtual block number of the received logical 
address and the virtual block number of any entry in the 
address translation buffer, further comprising the steps of 
storing into the real memory the selected logical partition for 
the received logical address, and creating an entry in the 
address translation buffer for the block of memory addressed 
by the received logical address with the associated partition 
valid bit being set , 

16. A method for address translation mapping of logical 
partitions for address translation buffer entries in a data 
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real memory, wherein the block of virtual memory is divided 
into a plurality of logical partitions, and wherein a logical 
address includes a virtual block number addressing the block 
of memory in virtual memory space, a plurality of logical 
partition selection bits selecting the selected logical partition 
within the block of memory, and a real address offset, the 
virtual memory mapping system comprising: 
an address translation buffer containing a plurality of 
entries, wherein each entry in the address translation 
buffer is associated with a block of virtual memory and 
includes a virtual block number, a real block number 
addressing the block of memory in real memory, and a 
plurality of logical partition valid bits, wherein a logi- 
cal partition of the block of virtual memory is indicated 
as valid in real memory by an associated logical 
partition valid bit among the plurality of logical parti- 
tion valid bits; 
means far combining the real block number with the 
logical partition selection bits and the real address 
offset to form a physical address in real memory for the 
memory reference in the selected logical partition, 
when the selected logical partition is indicated as valid 
by the associated logical partition valid bit; 
a comparator that receives a logical address for a memory 
reference to a selected logical partition of a block of 
virtual memory and determines if mere is a match 
between the virtual block number of the received 
logical address and a virtual block number for any entry 
in the address translation buffer; 
selection means for selecting the logical partition valid bit 

associated with the selected logical partition; and 
a physical address register that receives and stores the real 
block number, the logical partition selection bits and 
the real address offset as a physical address to the real 
memory for the memory reference, when the logical 
partition valid bit associated with the selected logical 
partition indicates the selected partition is valid in real 
memory regardless of whether logical partitions of said 
particular block other than said selected logical parti- 
tion are valid within said real memory. 

19. A virtual memory mapping system according to claim 
18, further comprising a main memory that outputs the 
memory reference in the selected logical partition addressed 
by the formed physical address, 

20. A virtual memory mapping system according to claim 
18, further comprising a main memory capable of storing the 
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selected partition of the block of memory addressed by the 
received logical address, and wherein the address translation 
buffer creates an entry far the block of memory addressed by 
the received logical address, when the comparator deter- 
mines there is not a match of the virtual block number of the 
received logical address and the virtual block number of any 
entry in the address translation buffer. 

21. A virtual memory mapping system according to claim 
18. further comprising a main memory capable of storing the 



address to indicate that the selected logical partition of the 
block of memory is contained in real memory, when the 
selected logical partition is not indicated as valid in an entry 
having a match with the virtual block number of the received 
logical address. 

22 A virtual memory mapping system according to claim 
18. wherein the physical address register concatenates the 
real block number with the plurality of logical partition 



block of memory addressed by the received logical address, 10 selection bits and a real address offset to form a physical 
and wherein the address translation buffer updates the plu- address for the selected partition of the memory reference, 
rality of logical partition valid bits in the entry correspond- 
ing to the block of memory addressed by the received logical ***** 
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APPARATUS FOR SUPPORTING A LOGICALLY 
PARTITIONED COMPUTER SYSTEM 

CROSS REFERENCE TO RELATED 
APPLICATIONS 

[0001] This is a divisional application of U.S. patent 
application Ser. No. 09/346,206, filed Jail. 1, 1999, originally 
entitled "APPARATUS FOR SUPPORTING A LOGI- 
CALLY PARTITIONED COMPUTER SYSTEM", and by 
subsequent amendment entitled "GENERATING PARTI- 
TION CORRESPONDING REAL ADDRESS IN PARTI- 
TIONED MODE SUPPORTING SYSTEM", which is 
herein incorporated by reference. 

[0002] The present application is also related to the fol- 
lowing commonly assigned U.S. patents and patent appli- 
cations, all of which are herein incorporated by reference: 

[0003] Ser. No. 09/314,769, filed May 19, 1999, entitled 
Processor Reset Generated Via Memory Access Interrupt 
(Assignee's docket no. R0999-022). 

[0004] Ser. No. 09/314,541, filed May 19, 1999, entitled 
Apparatus and Method for Specifying Maximum Interactive 
Performance in a Logical Partition of a Computer (Assign- 
ee's docket no. R0999-021). 

[0005] Ser. No. 09/314,324, filed May 19, 1999, entitled 
Management of a Concurrent Use License in a Logically 
Partitioned Computer (Assignee's docket no. R09 99-023). 

[0006] Ser. No. 09/314,214, filed May 19, 1999, entitled 
Logical Partition Manager and Method (Assignee's docket 
no. R0999-025). 

[0007] U.S. Pat. No. 6,279,046 to Armstrong et al., 
entitled Event-Driven Communications Interface for Logi- 
cally-Partitioned Computer. 

[0008] U.S. Pat. No. 6,161,166 to Doing et al., entitled 
Instruction Cache for Multithreaded Processor. 

[0009] U.S. Pat. No. 6,263,404 to Borkenhagen et al., 
entitled Accessing Datafrom a Multiple Entry Fully Asso- 
ciative Cache Buffer in a Multithread Data Processing 
System. 

[0010] U.S. Pat. No. 6,021,481 to Eickemeyer et al., 
entitled Effective-To-RealAddress Cache Managing Appa- 
ratus and Method. 

[0011] U.S. Pat. No. 6,212,544 to Borkenhagen et al., 
entitled Altering Thread Priorities in a Multithreaded Pro- 
cessor. 

[0012] Ser. No. 08/958,716, filed Oct. 23, 1997, entitled 
Method and Apparatus for Selecting Thread Switch Events 
in a Multithreaded Processor (Assignee's docket no. R0997- 
104). 

[0013] Ser. No. 08/957,002, filed Oct. 23, 1997, entitled 
Thread Switch Control in a Multithreaded Processor System 
(Assignee's docket no. R0996-042). 

[0014] U.S. Pat. No. 6,105,051 to Borkenhagen et al., 
entitled An Apparatus and Method to Guarantee Forward 
Progress in a Multithreaded Processor. 

[0015] U.S. Pat. No. 6,076,157 to Borkenhagen et al., 
entitled Method and Apparatus To Force a Thread Switch in 
a Multithreaded Processor. 



[0016] U.S. Pat. No. 6,088,788 to Borkenhagen et al., 
entitled Background Completion of Instruction and Associ- 
ated Fetch Request in a Multithread Processor. 

FIELD OF THE INVENTION 

[0017] The present invention relates generally to digital 
data processing, and more particularly to support within a 
processing unit for logically partitioning of a digital com- 
puter system. 

BACKGROUND OF THE INVENTION 

[0018] A modem computer system typically comprises a 
central processing unit (CPU) and supporting hardware 
necessary to store, retrieve and transfer information, such as 
communications busses and memory. It also includes hard- 
ware necessary to communicate with the outside world, such 
as input/output controllers or storage controllers, and 
devices attached thereto such as keyboards, monitors, tape 
drives, disk drives, communication lines coupled to a net- 
work, etc. The CPU is the heart of the system. It executes the 
instructions which comprise a computer program and directs 
the operation of the other system components. 

[0019] From the standpoint of the computer's hardware, 
most systems operate in fundamentally the same manner. 
Processors are capable of performing a limited set of very 
simple operations, such as arithmetic, logical comparisons, 
and movement of data from one location to another. But 
each operation is performed very quickly. Programs which 
direct a computer to perform massive numbers of these 
simple operations give the illusion that the computer is 
doing something sophisticated. What is perceived by the 
user as a new or improved capability of a computer system 
is made possible by performing essentially the same set of 
very simple operations, but doing it much faster. Therefore 
continuing improvements to computer systems require that 
these systems be made ever faster. 

[0020] The overall speed of a computer system (also 
called the "throughput") may be crudely measured as the 
number of operations performed per unit of time. Concep- 
tually, the simplest of all possible improvements to system 
speed is to increase the clock speeds of the various compo- 
nents, and particularly the clock speed of the processor. E.g., 
if everything runs twice as fast but otherwise works in 
exactly the same manner, the system will perform a given 
task in half the time. Early computer processors, which were 
constructed from many discrete components, were suscep- 
tible to significant speed improvements by shrinking com- 
ponent size, reducing component number, and eventually, 
packaging the entire processor as an integrated circuit on a 
single chip. The reduced size made it possible to increase the 
clock speed of the processor, and accordingly increase 
system speed. 

[0021] Despite the enormous improvement in speed 
obtained from integrated circuitry, the demand for ever 
faster computer systems has continued. Hardware designers 
have been able to obtain still further improvements in speed 
by greater integration (i.e., increasing the number of circuits 
packed onto a single chip), by further reducing the size of the 
circuits, and by various other techniques. However, design- 
ers can see that physical size reductions can not continue 
indefinitely, and there are limits to their ability to continue 
to increase clock speeds of processors. Attention has there- 
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fore been directed to other approaches for further improve- 
ments in overall speed of the computer system. 

[0022] Without changing the clock speed, it is possible to 
improve system throughput by using multiple processors. 
The modest cost of individual processors packaged on 
integrated circuit chips has made this practical. While there 
are certainly potential benefits to using multiple processors, 
numerous additional architectural issues are introduced. In 
particular, multiple processors typically share the same main 
memory (although each processor may have it own cache). 
It is necessary to devise mechanisms that avoid memory 
access conflicts. For example, if two processors have the 
capability to concurrently read and update the same data, 
there must be mechanisms to assure that each processor has 
authority to access the data, and that the resulting data is not 
gibberish. Without delving into further architectural com- 
plications of multiple processor systems, it can still be 
observed that there are many reasons to improve the speed 
of the individual CPU, whether or not a system uses multiple 
CPUs or a single CPU. If the CPU clock speed is given, it 
is possible to further increase the speed of the individual 
CPU, i.e., the number of operations executed per second, by 
increasing the average number of operations executed per 
clock cycle. 

[0023] In order to boost CPU speed, it is common in high 
performance processor designs to employ instruction pipe- 
lining, as well as one or more levels of cache memory. 
Pipeline instruction execution allows subsequent instruc- 
tions to begin execution before previously issued instruc- 
tions have finished. Cache memories store frequently used 
and other data nearer the processor and allow instruction 
execution to continue, in most cases, without waiting the full 
access time of a main memory. 

[0024] Pipelines will stall under certain circumstances. An 
instruction that is dependent upon the results of a previously 
dispatched instruction that has not yet completed may cause 
the pipeline to stall. For instance, instructions dependent on 
a load/store instruction in which the necessary data is not in 
the cache, i.e., a cache miss, cannot be executed until the 
data becomes available in the cache. Maintaining the req- 
uisite data in the cache necessary for continued execution 
and to sustain a high hit ratio, i.e., the number of requests for 
data compared to the number of times the data was readily 
available in the cache, is not trivial especially for compu- 
tations involving large data structures. A cache miss can 
cause the pipelines to stall for several cycles, and the total 
amount of memory latency will be severe if the data is not 
available most of the time. Although memory devices used 
for main memory are becoming faster, the speed gap 
between such memory chips and high-end processors is 
becoming increasingly larger. Accordingly, a significant 
amount of execution time in current high-end processor 
designs is spent waiting for resolution of cache misses. 

[0025] It can be seen that the reduction of time the 
processor spends waiting for some event, such as re-filling 
a pipeline or retrieving data from memory, will increase the 
average number of operations per clock cycle. One archi- 
tectural innovation directed to this problem is called "mul- 
tithreading". This technique involves breaking the workload 
into multiple independently executable sequences of instruc- 
tions, called threads. At any instant in time, the CPU 
maintains the state of multiple threads. As a result, it is 



relatively simple and fast to switch threads. The term 
"multithreading" as defined in the computer architecture 
community is not the same as the software use of the term 
which means one task subdivided into multiple related 
threads. In the architecture definition, the threads may be 
independent. Therefore "hardware multithreading" is often 
used to distinguish the two uses of the term. As used herein, 
"multithreading" will refer to hardware multithreading. 

[0026] There are two basic forms of multithreading. In the 
more traditional form, sometimes called "fine-grained mul- 
tithreading", the processor executes N threads concurrently 
by interleaving execution on a cycle -by-cycle basis. This 
creates a gap between the execution of each instruction 
within a single thread, which removes the need for the 
processor to wait for certain short term latency events, such 
as re-filling an instruction pipeline. In the second form of 
multithreading, sometimes called "coarse-grained multi- 
threading", multiple instructions in a single thread are 
sequentially executed until the processor encounters some 
longer term latency event, such as a cache miss. 

[0027] Typically, multithreading involves replicating the 
processor registers for each thread in order to maintain the 
state of multiple threads. For instance, for a processor 
implementing the architecture sold under the trade name 
PowerPC™ to perform multithreading, the processor must 
maintain N states to run N threads. Accordingly, the follow- 
ing are replicated N times: general purpose registers, floating 
point registers, condition registers, floating point status and 
control register, count register, link register, exception reg- 
ister, save/restore registers, and special purpose registers. 
Additionally, the special buffers, such as a segment looka- 
side buffer, can be replicated or each entry can be tagged 
with the thread number and, if not, must be flushed on every 
thread switch. Also, some branch prediction mechanisms, 
e.g., the correlation register and the return stack, should also 
be replicated. However, larger hardware structures such as 
caches and execution units are typically not replicated. 

[0028] In a computer system using multiple CPUs (sym- 
metrical multi-processors, or SMPs), each processor sup- 
porting concurrent execution of multiple threads, the 
enforcement of memory access rules is a complex task. In 
many systems, each user program is granted a discrete 
portion of address space, to avoid conflicts with other 
programs and prevent unauthorized accesses. However, 
something must allocate addresses in the first place, and 
perform other necessary policing functions. Therefore, spe- 
cial supervisor programs exist which necessarily have access 
to the entire address space. It is assumed that these super- 
visor programs contain "trusted" code, which will not dis- 
rupt the operation of the system. In the case of a multipro- 
cessor system, it is possible that multiple supervisor 
programs will be running on multiple SMPs, each having 
extraordinary capability to access data addresses in memory. 
While this does not necessarily mean that data will be 
corrupted or compromised, avoidance of potential problems 
adds another layer of complexity to the supervisor code. 
This additional complexity can adversely affect system 
performance. To the extent hardware within each SMP can 
assist software supervisors, performance can be improved. 

[0029] In a large multiprocessor system, it may be desir- 
able to partition the system into one or more smaller logical 
SMPs, an approach known as logical partitioning. In addi- 
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tion, once a system is partitioned it may be desirable to 
dynamically re -partition the system based on changing 
requirements. It is possible to do this using only software. 
The additional complexity this adds to the software can 
adversely affect system performance. Logical partitioning of 
a system would be more effective if hardware support were 
provided to assist the software. Hardware support may be 
useful to help software isolate one logical partition from 
another. Said differently, hardware support may be used to 
prevent work being performed in one logical partition from 
corrupting work being performed in another. Hardware 
support would also be useful for dynamically re-partitioning 
the system in an efficient manner. This hardware support 
may be used to enforce the partitioning of system resources 
such as processors, real memory, internal registers, etc. 

SUMMARY OF THE INVENTION 

[0030] It is therefore an object of the present invention to 
provide an improved processor apparatus. 

[0031] Another object of this invention is to provide 
greater support, and in particular hardware support, for 
logical partitioning of a computer system. 

[0032] Another object of this invention is to provide an 
apparatus having greater hardware regulation of memory 
access in a processor. 

[0033] Another object of this invention is to increase the 
performance of a computer system having multiple proces- 
sors. 

[0034] Another object of the invention is to improve 
multithreaded processor hardware control for logical parti- 
tioning of a computer system. 

[0035] A processor provides hardware support for logical 
partitioning of a computer system. Logical partitions isolate 
the real address spaces of processes executing on different 
processors, specifically, supervisory processes. An ultra- 
privileged supervisor process, called a hypervisor, regulates 
the logical partitions. 

[0036] In the preferred embodiment, the processor con- 
tains multiple register sets for supporting the concurrent 
execution of multiple threads (i.e., hardware multithread- 
ing). Each thread is capable of independently being in either 
hypervisor, supervisor or problem (non-privileged) state. 

[0037] In the preferred embodiment, each processor gen- 
erates effective addresses from executable code, which are 
translated to real addresses corresponding to locations in 
physical main memory. Certain processes, particularly 
supervisory processes, may optionally run in a special 
(effective address equals real address^mode. In this mode, 
real addresses are constrained"wTthin alo^cal partition by 
effectively concatenating" certain high order bits from a 
special register (real merrtefy— ©ffset-~fegister) with lower 
order bits of the effective address. For clarity, the effective 
address in effective «real mode is referred to herein as a base 
real address, while the resultant address after partitiojujigfe 
referred to as a partitioned real address. Logical pa^ioninj^ 
of^the^'df^ss^spa ce amou nts to an enforced constrainTon 
"certainjii gh ordeV ad dress bits, so that within any given 
~"pafti(i61fthese addresTbitsare'trle same. Partitioning is thus 
distinguished from typical address translation, wherein a 
range of effective addresses is arbitrarily correlated a range 



of real addresses. The hardware which partitions a real 
address is actually a set of OR gates which perform a logical 
OR of the contents of the real memory ofiset register with an 
equal number of high order bits of effective address (base 
real address). By convention, the high order bits of effective 
address (i.e., in the base real address) which are used 
constrain the address to a logical partition should be 0. A 
separate range check mechanism concurrently verifies that 
these high order effective address bits are in fact 0, and 
generates a real address space check signal if they are not. 

[0038] In the preferred embodiment, the range check 
mechanism includes a 2-bit real memory limit register, and 
a set of logic gates. The limit register specifies the number 
of high order effective address (base real address) bits which 
must be zero (i.e., the size of the logical partition memory 
resource). The limit register value generates a mask, which 
is logically ANDed with selected bits of the effective 
address. The resulting bits are then logically ORed together 
to generate the real address space check signal. The use of 
this limit register mechanism supports logically partitioned 
memory spaces of different sizes. 

[0039] In the preferred embodiment, instruction addresses 
can be pre-fetched in anticipation of execution. In particular, 
dormant thread instructions may be pre-fetched while 
another thread is processing and executing instructions. The 
partitioning mechanism checks and controls instruction pre- 
fetching independently of the actively running thread. 

[0040] In the preferred embodiment, special operating 
system software running in hypervisor state can dynamically 
re -allocate resources to logical partitions. In particular, it can 
alter the contents of the real memory offset register and the 
real memory limit register which regulate the generation of 
partitioned real addresses; a logical partition identifier which 
identifies the logical partition to which a processor is 
assigned; and certain configuration information. 

[0041] In the preferred embodiment, the processor sup- 
ports different systems which use the hypervisor, supervisor 
and problem states differently. Thus, one mode of operation 
supports effective-real addressing mode in any state, but 
addresses are partitioned and checked as described above 
when operating in non-hypervisor state. A second mode of 
operation supports effective-real addressing mode in only 
the hypervisor state. 

[0042] The enforcement of logical partitioning by proces- 
sor hardware which intercepts a base real address and 
converts it to a partitioned real address removes the need for 
low-level operating system software to verify certain 
address constraints among multiple processors and threads, 
reducing the burden on operating system software and 
improving system performance. 

[0043] Other objects, features and characteristics of the 
present invention; methods, operation, and functions of the 
related elements of the structure; combination of parts; and 
the like will become apparent from the following detailed 
description of the preferred embodiments and accompany- 
ing drawings, all of which form a part of this specification, 
wherein like reference numerals designate corresponding 
parts in the various figures. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

[0044] FIG. 1 is a high-level block diagram of the major 
hardware components of a computer system having multiple 
CPUs, according to the preferred embodiment of the inven- 
tion described herein. 

[0045] FIG. 2 is a high-level diagram of a central pro- 
cessing unit of a computer system according to the preferred 
embodiment. 

[0046] FIG. 3 illustrates the major components of an LI 
instruction cache, according to the preferred embodiment. 

[0047] FIG. 4 illustrates in greater detail real address 
partitioning logic, an effective to real address table and 
associated control structures for instruction addresses, 
according to the preferred embodiment, 

[0048] FIG. 5 illustrates real address partitioning logic for 
data addresses, according to the preferred embodiment. 

[0049] FIG. 6 illustrates the generation of instruction 
storage interrupts for enforcing logical partitioning, accord- 
ing to the preferred embodiment. 

[0050] FIG. 7 illustrates at a high level the generation of 
effective addresses for instructions, according to the pre- 
ferred embodiment. 

[0051] FIG. 8 is a logical illustration of address transla- 
tion, according to the preferred embodiment. 

[0052] FIG. 9 illustrates the operation of certain state and 
configuration registers, according to the preferred embodi- 
ment. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

[0053] The major hardware components of a multiproces- 
sor computer system 100 for utilizing the logical partitioning 
architecture according to the preferred embodiment of the 
present invention are shown in FIG. 1. CPUs 101 A, 101B, 
101C and 101D for processing instructions contains separate 
respective internal level one instruction caches 106 A, 106B, 
106C, 106D (Lll-cache) and level one data caches 107A, 
107B, 107C, 107D (LID-cache). Each Lllncache 106A, 
106B, 106C, 106 D stores instructions for execution by its 
CPU 101A, 101B, 101C, 101D. L I D-cache stores data 
(other than instructions) to be processed by a CPU. Each 
CPU 101A, 101B, 101C, 101D is coupled to a respective 
level two cache (L2 cache) 108A, 108B, 108C, 108D, which 
can be used to hold both instructions and data. Memory bus 
109 transfers data between L2 caches or CPU on the one 
hand and main memory 102 on the other. CPUs 101A, 101B, 
101 C, 101D, L2 cache 108A, 108B, 108C, 108D and main 
memory 102 also communicate via bus interface 105 with 
system bus 110. Various I/O processing units (IOPs) 111-115 
attach to system bus 110 and support communication with a 
variety of storage and I/O devices, such as direct access 
storage devices (DASD), tape drives, workstations, printers, 
and remote communication lines for communicating with 
remote devices or other computer systems. For simplicity, 
CPU, Lll-cache, LI D-cache, and L2 cache arc herein 
designated generically by reference numbers 101, 106, 107 
and 108, respectively. While various buses are shown in 
FIG. 1, it should be understood that these are intended to 



represent various communications paths at a conceptual 
level, and that the actual physical configuration of buses may 
vary. 

[0054] In the preferred embodiment, each CPU is capable 
of maintaining the state of two threads, and switches execu- 
tion between threads on certain latency events. I.e., CPU 
executes a single thread (the active thread) until some 
latency event is encountered which would force the CPU to 
wait, (a form of coarse-grained multithreading). Thread 
switching conditions and mechanisms are described in 
greater detail in U.S. Pat. No. 6,212,544, U.S. Pat. No. 
6,105,051, U.S. Pat. No. 6,076,157, and in commonly 
assigned copending U.S. patent application Ser. Nos. 
08/958,716, 08/957,002, all filed Oct. 23, 1997, and incor- 
porated herein by reference. However, it should be under- 
stood that the present invention could be practiced with a 
different number of thread states in each CPU, and that it 
would be possible to interleave execution of instructions 
from each thread on a cycle-by-cycle basis (fine-grained 
multithreading), or to switch threads on some different basis. 

[0055] FIG. 2 is a high level diagram of the major 
components of CPU 101, showing CPU 101 in greater detail 
than is depicted in FIG. 1, according to the preferred 
embodiment. In this embodiment, the components shown in 
FIG. 2 are packaged on a single semiconductor chip. CPU 
101 includes instruction unit portion 201, execution unit 
portion 211 and 212, and storage control portion 221. In 
general, instruction unit 201 obtains instructions from Lll- 
cache 106, decodes instructions to determine operations to 
perform, and resolves branch conditions to control program 
flow. Execution unit 211 performs arithmetic and logical 
operations on data in registers, and loads or stores data. 
Storage control unit 221 accesses data in the LI data cache 
or interfaces with memory external to the CPU where 
instructions or data must be fetched or stored. 

[0056] Instruction unit 201 comprises branch unit 202, 
buffers 203, 204, 205, and decode/dispatch unit 206. Instruc- 
tions from Lll-cache 106 are loaded into one of the three 
buffers from Lll -Cache instruction bus 232. Sequential 
buffer 203 stores 16 instructions in the current execution 
sequence. Branch buffer 205 stores 8 instructions from a 
branch destination; these are speculatively loaded into buffer 
205 before branch evaluation, in the event the branch is 
taken. Thread switch buffer 204 stores 8 instructions for the 
inactive thread; in the event a thread switch is required from 
the currently active to the inactive thread, these instructions 
will be immediately available. Decode/dispatch unit 206 
receives the current instruction to be executed from one of 
the buffers, and decodes the instruction to determine the 
operation(s) to be performed or branch conditions. Branch 
unit 202 controls the program flow by evaluating branch 
conditions, and refills buffers from Lll-cache 106 by send- 
ing an effective address of a desired instruction on Lll- 
Cache address bus 231. 

[0057] Execution unit 211 comprises S-pipe 213, M-pipe 
214, R-pipe 215 and a bank of general purpose registers 217. 
Registers 217 are divided into two sets, one for each thread. 
R-pipe is a pipelined arithmetic unit for performing a subset 
of integer arithmetic and logic functions for simple integers. 
M-pipe 214 is a pipelined arithmetic unit for performing a 
larger set of arithmetic and logic functions. S-pipe 213 is a 
pipelined unit for performing load and store operations. 
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Floating point unit 212 and associated floating point regis- 
ters 216 are used for certain complex floating point opera- 
tions which typically require multiple cycles. Like general 
purpose registers 217, floating point registers 216 are 
divided into two sets, one for each thread. 

[0058] Storage control unit 221 comprises memory man- 
agement unit 222, L2 cache directory 223, L2 cache inter- 
face 224, LI data cache 107, and memory bus interface 225. 
LI D-cache is an on-chip cache used for data (as opposed to 
instructions). L2 cache directory 223 is a directory of the 
contents of L2 cache 108. L2 cache interface 224 handles the 
transfer of data directly to and from L2 cache 108. Memory 
bus interface 225 handles the transfer of data across memory 
bus 109, which may be to main memory 102 or to L2 cache 
units associated with other CPUs. Memory management unit 
222 is responsible for routing data accesses to the various 
units. E.g., when S-pipe 213 processes a load command, 
requiring data to be loaded to a register, memory manage- 
ment unit may fetch the data from LID -cache 107, L2 cache 
108, or main memory 102. Memory management unit 222 
determines where to obtain the data. LI D-cache 107 is 
directly accessible, as is the L2 cache directory 223, 
enabling unit 222 to determine whether the data is in either 
LI D-cache 107 or L2 cache 108. If the data is in neither 
on-chip LI D-cache nor L2 cache 108, it is fetched from 
memory bus 109 using memory interface 225. 

[0059] While various CPU components have been 
described and shown at a high level, it should be understood 
that the CPU of the preferred embodiment contains many 
other components not shown, which are not essential to an 
understanding of the present invention. For example, various 
additional special purpose registers will be required in a 
typical design, some of which must be replicated for each 
thread. It should also be understood that the number, type 
and arrangement of components within CPU 101 could be 
varied. For example, the number and configuration of buff- 
ers and caches may vary; the number and function of 
execution unit pipelines may vary; registers may be config- 
ured in different arrays and sets; dedicated floating point 
processing hardware may or may not be present; etc. 

[0060] CPU 101 of the preferred embodiment supports 
multiple levels of address translation, as logically illustrated 
in FIG. 8. The three basic addressing constructs are effective 
address 801, virtual address 802, and real address 803. An 
"effective address" refers to the address from the point of 
view of the executable code, i.e., it is an instruction address 
generated by instruction unit 201, or a data address gener- 
ated by execution unit 211. An effective address may be 
produced in any of various ways known in the art, e.g., as a 
concatenation of some high-order address bits in a special- 
purpose register (which changes infrequently, e.g., when 
execution of a new task is initiated) and lower order address 
bits from an instruction; as a computed offset from an 
address in a general purpose register; as an offset from the 
currently executing instruction; etc., as illustrated in greater 
detail in FIG. 7, explained below. In this embodiment, an 
effective address comprises 64 bits, numbered 0 to 63 (0 
being the highest order bit). A "virtual address" is an 
operating system construct,: used to isolate the address 
spaces of different users. I.e., if each user may reference the 
full range of effective addresses, then the effective address 
spaces of different users must be mapped into a larger virtual 
address space to avoid conflicts. The virtual address is not a 



physical entity in the sense that it is stored in registers; it is 
a logical construction, resulting from a concatenation of a 
52-bit virtual segment ID 814 and the low-order 28 bits of 
the effective address, a total of 80 bits. A "real address" 
refers to a physical location in memory 102 where the 
instruction or data is stored. The real address comprises 40 
bits numbered 24 to 63 (24 being the highest order bit). 

[0061] As shown in FIG. 8, an effective address 801 
comprises 36-bit effective segment ID 811, 16-bit page 
number 812, and 12-bit byte index 813, the effective seg- 
ment ID occupying the highest order bit positions. A virtual 
address 802 is constructed from an effective address by 
mapping the 36-bit effective segment ID 811 to a 52 -bit 
virtual segment ID 814, and concatenating the resultant 
virtual segment ID 814 with page number 812 and byte 
index 813. A real address 803 is derived from the virtual 
address by mapping the virtual segment ID 814 and page 
number 812 to a 28-bit real page number 815, and concat- 
enating the real page number with byte index 813. Because 
a page of main memory contains 4K (i.e., 2 12 ) bytes, the byte 
index 813 (lowest order 12 address bits) specifies an address 
within a page, and is the same whether the address is 
effective, virtual or real. The higher order bits specify a page, 
and are therefore sometimes referred to as an "effective page 
number" or "real page number", as the case may be. 

[0062] Computer system 100 contains an address transla- 
tion mechanism for translating effective addresses generated 
by CPU 101 to real addresses in memory 102. This address 
translation mechanism includes a segment table mechanism 
821 for mapping effective segment ID 811 to virtual segment 
ID 814, and a page table mechanism 822 for mapping virtual 
segment ID 814 and page number 812 to real page number 
815. While these mechanisms are shown in FIG. 8 as single 
entities for illustrative purposes, they in fact comprise mul- 
tiple tables or register at different levels. I.e., a complete 
page table and a complete segment table reside in main 
memory 102, while various smaller cached portions of the 
data in these tables is contained in CPU 101 itself or the L2 
cache. There are additional translation mechanisms (not 
shown) which will in limited circumstances translate 
directly from an effective to a real address. 

[0063] While CPU 101 supports address translation as 
illustrated in FIG. 8, it also supports more simple address- 
ing. Specifically, one of the operating modes is a "tags 
active" mode, in which effective addresses are the same as 
virtual addresses (i.e., an effective segment ID 811 maps 
directly to virtual segment ID 814 without lookup, so that 
the high-order 16 bits of virtual segment ID are always 0), 
CPU 101 may also operate in an effective^real addressing 
mode. 

[0064] Effective -real mode (E-R) is a special addressing 
mode, typically reserved for certain low level operating 
system functions which operate more efficiently if always 
stored at the same real address locations. These operating 
system functions may need to access reserved areas of 
memory, and therefore typically execute in a special privi- 
leged state (as opposed to most user executable code, which 
executes in a non-privileged state called a "problem state"). 
These operating system functions are created and tested by 
a process assumed to be trusted, in the sense that the 
resulting code will not cause unauthorized interference with 
machine processes. When executing in E-R mode and 
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without logical partitioning, the lower order 40 bits of 
effective address (i.e., EA^ 63 ) generated by instruction unit 
201 (in the case of instructions) or execution unit 211 (in the 
case of data) is the same as the real address (RA^ 63 ); the 
high order effective address bits are assumed to be 0. When 
operating in E=R mode, addresses are not translated, i.e., the 
page table mechanism and segment table mechanism, 
described above, along with any associated caches, are not 
used. This has the effect of mapping all E=R mode processes 
to the same real memory, even when executing on different 
processors. E=R mode addressing is active when either (a) 
an applicable address translate bit in one of the machine state 
registers is set off, or (b) under certain circumstances, when 
the effective address lies within a special reserved range of 
addresses. Appropriate hardware logic (not shown) detects 
these conditions and generates an E=R control signal for use 
by addressing logic. 

[0065] In the preferred embodiment, computer system 100 
can be logically partitioned. Logical partitioning means that 
the system is logically divided into multiple subsets called 
logical partitions, and some of the system resources are 
assigned to particular logical partitions, while other 
resources are shared among partitions. In the preferred 
embodiment, processors and real memory are assigned to 
logical partitions in a partitioned system, while buses, I/O 
controllers, and I/O devices are shared, it being understood 
that it would be possible to assign different types and 
mixtures of devices to partitions. In a logically partitioned 
system, each processor of the multiprocessor system is 
assigned to a partition, along with a subset of the real 
memory address space. With limited exceptions (explained 
below), tasks executing on a processor can only access real 
memory within that processor's subset of the real memory 
address space. This has the effect of isolating tasks executing 
on different processors in different logical partitions. From 
the standpoint of CPU and memory, the logically partitioned 
multiprocessor computer system behaves very much like 
multiple separate computer systems. This avoids some of the 
contention and other overhead issues associated with prior 
art multiprocessor systems. At the same time, the different 
logical partitions share hardware resources such as disk 
storage and I/O, as well as certain low level software 
resources. Thus, many of the advantages of a multiprocessor 
system over multiple discrete single processor systems are 
maintained. Furthermore, it is possible for multiple proces- 
sors to share a single logical partition. For example, a 
computer system containing 16 processors could be config- 
ured in four logical partitions, each containing four proces- 
sors, and resembling in certain characteristics the perfor- 
mance of four 4-way multiprocessor systems as opposed to 
a single 16- way multiprocessor system. 

[0066] Since user executable (non-privileged) code is 
typically translated as described above from an effective 
address to a real address (with or without the intermediate 
virtual address), this same basic mechanism can be used to 
support logical partitioning. The operating system will 
assign a block of user- accessible address space to a block of 
real memory address space lying within the logical partition 
of the processor executing the user code. Subsequent refer- 
ences to an effective address within this block will be 
translated using the translation mechanisms to the corre- 
sponding block of real memory address space. Thus, user 
executable code will reference something within the logical 



partition of the processor, without affecting memory outside 
the processor's logical partition. 

[0067] However, the translation mechanism can not 
enforce logical partitioning of address references in E=R 
mode. Generally, this is privileged code, created using a 
trusted process. Even though the code is created using a 
trusted process, there are performance reasons to isolate 
such code executing on different processors to different 
logical partitions. At the same time, there is still a need for 
some operating system functions to have access to the entire 
real memory. 

[0068] To support logical partitioning, two privileged 
execution states are defined, in addition to the non-privi- 
leged "problem state". The privileged execution states are 
called "supervisor state" and "hypervisor state". Most privi- 
leged functions execute in the supervisor state, and are 
confined to the logical partition of the processor upon which 
they are executing. Supervisor state code may be untrans- 
lated, in which case the high-order effective address bits are 
directly manipulated by hardware to confine address refer- 
ences to the logical partition of the executing processor. In 
this manner, duplicates of these functions can concurrently 
execute on different processors in different logical partitions, 
without concern for the effect on other logical partitions. 
Only a select few functions, such as those which support 
logical partitioning itself, execute in the ultra-privileged 
hypervisor state, and have access to the full real address 
space of computer system 100. Each executing thread has its 
own privilege state (either hypervisor, supervisor, or prob- 
lem), which is independent of the privilege state associated 
with any other thread. 

[0069] Processor state and configuration information is 
maintained in a set of special-purpose registers. FIG. 9 
illustrates some of these registers and associated control 
structures. The key register is Active-Thread Machine State 
Register (MSR) 901, which maintains certain state informa- 
tion for the currently active thread. Dormant-Thread 
Machine State Register (MSRDorm) 902 maintains the same 
type of information for the currently dormant thread. Each 
register 901, 902 contains the following respective bits, 
among others: 

[0070] DR bit, which indicates the corresponding 
thread's data addresses should be translated; 

[0071] IR bit, which indicates the corresponding 
thread's instruction addresses should be translated; 

[0072] Pr bit, which indicates whether the corre- 
sponding thread is in problem state; 

[0073] TAbit, which indicates "tags active" mode. 

[0074] HV bit, which indicates the corresponding 
thread is in hypervisor state; 

[0075] FIG. 9 illustrates respective data relocate (DR) 
signal lines 921, 931; instruction relocate (IR) signal lines 
922, 932; problem state signal lines 923, 933; tags active 
signal lines 924, 934; and hypervisor state signal lines 925, 
935. 

[0076] A machine state register is not permanently asso- 
ciated with a thread; rather, there is one physical register 901 
which always contains the information for the active thread, 
and another which contains the dormant thread's informa- 
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tion. For this reason, an Active Thread Identifier bit 961 is 
needed to identify which is the active or dormant thread. 
ActThreadID bit 961 is kept in a separate special register. 
Upon a thread switch, the contents of registers 901 and 902 
are swapped, and ActThreadID bit 961 is changed. Swap- 
ping register contents simplifies downstream control mecha- 
nisms, since in most cases only the contents of the active 
thread MSR 901 are relevant. 

[0077] As shown in FIG. 9, input to each machine state 
register 901, 902 is controlled by a respective multiplexer 
903, 904, which receives inputs from various sources. The 
inputs to multiplexer 903 illustrate the various ways in 
which MSR 901 can be altered. Input path 941 represents a 
move to MSR (mtMSR) instruction, i.e., MSR 901 can be 
altered by executing a special mtMSR instruction while in a 
privileged state, which causes data to be loaded directly 
from a general purpose register into the MSR. Input path 942 
represents an interrupt state, i.e., upon occurrence of an 
interrupt condition, the MSR is automatically loaded with a 
predefined state associated with the interrupt. Input paths 
943 and 944 represent a System Call and a System Call 
Vectored, respectively. These are special processor instruc- 
tions, typically made while in the problem state in order to 
invoke a privileged state. Both cause a predefined state to be 
loaded into MSR and a jump to a predefined location. 
System Call Vectored does not affect as many bits in the 
MSR as does System Call, i.e., System Call Vectored causes 
only a few bits to change, most of the bits being simply 
copied from their current state. Return from System Call 
Vectored (rfecv) path 945 represents reloading the MSR with 
its previous state upon return from a System Call Vectored; 
these values are stored in a special register (not shown). 
Return from interrupt/system call path 946 is conceptually 
similar to rfscv 945, and represents reloading the MSR with 
its previous state upon return from an interrupt or a System 
Call. The previous state of the MSR is saved in SRR1 
registers 905, 906, which are special purpose registers for 
holding a saved state. One SRR1 register is associated with 
each thread, and the state of MSR 901 is saved to the register 
associated with the currently active thread as identified by 
ActThreadID 961. Upon return from an interrupt or System 
Call, multiplexer 907 selects the appropriate register 905 or 
906 for restoring the previous MSR state. Input path 947 
represents the contents of MSRDorm 902, which is loaded 
into MSR 901 upon a thread switch. Input path 948 repre- 
sents the current contents of MSR 901; because some of the 
events which cause changes to MSR 901 do not affect all 
bits, this path represents a copying of non-affected bits back 
into MSR 901. 

[0078] MSRDorm 902 is altered in similar fashion, 
although fewer paths are shown in FIG. 9 because an 
interrupt, System Call, or System Call Vectored can apply 
only to the currently active thread. Like MSR 901, MSR- 
Dorm 902 can be altered by a special move to MSRDorm 
instruction, as represented by path 951. MSRDorm will 
receive the contents of MSR 901 upon a thread switch, as 
represented by path 952. Finally, path 953 represents copy- 
ing bits not affected by a change back into MSRDorm 902. 

[0079] Also shown in FIG. 9 is a set of configuration 
registers 910. Unlike MSR 901 and MSRDorm 902, these 
registers contain configuration information which is 
intended to change rarely, if at all. I.e., information in 
configuration registers 910 might be set upon initial instal- 



lation of a system and might be altered upon major recon- 
figuration, such as the addition of processors to the system, 
or the system being re-partitioned. These registers can be 
altered only in hypervisor mode, i.e., are not intended to be 
written to from user executable code. Typically, information 
is loaded into configuration registers by a special-purpose 
service processor during system initialization. Among the 
information held in configuration registers 910 is a Logical 
Partitioning Environment Selector (LPES) bit 911 This bit is 
used to specify one of two operating system environments, 
designated "RS" and "AS". In the "RS" environment, non- 
hypervisor address references in E-R mode must be forced 
into the real memory subset of the processor's logical 
partition; in the "AS" environment, non-hypervisor address 
references in ER mode are not allowed. Configuration 
registers 910 also contain a 12-bit real memory offset field 
912, also referred to as a real memory offset register 
(RMOR), although it is physically part of the larger con- 
figuration register set 910. Configuration registers 910 also 
contain a 2-bit real memory limit field 913, also referred to 
as a real memory limit register (RMLR). Configuration 
registers 910 further contain a Logical Partition ID (LPID) 
field 914, which is an identifier assigned to the logical 
partition to which the processor belongs. 

[0080] The Pr bits 923, 933 and HV bits 925, 935 define 
the privilege state. If the HV bit is set, the corresponding 
thread is in the hypervisor state. If the HV bit is not set and 
the Pr bit is set, the corresponding thread is in the problem 
state. If nether bit is set, the corresponding thread is in the 
supervisor state. The HV bit can not be altered by a mtMSR 
instruction, for this would allow a thread in supervisor state 
to place itself in hypervisor state. The HV bit can only be set 
automatically by the hardware under certain predefined 
conditions, specifically certain interrupts (depending on the 
setting of LPES bit 911) or certain System Calls, any of 
which cause instructions to branch to one of a set of 
predefined locations. Naturally, these predefined locations 
must contain trusted code suitable for execution in hyper- 
visor state. All predefined locations associated with Hyper- 
visor state are contained within a single real address subset 
at the low address range. This subset is reserved and can not 
be assigned to any processor of multiprocessor system 100. 
The conditions for setting the HV bit can be summarized as 
follows: 



MSR(HV)<?={^LPES AND ( Any Jntermpt OR SysteinCal^)) OR 
{LPES AND CMachine_Check_Intemipt OR 
Systemjtesetjnterrupt OR SysteiruCatl^)) 



[0081] Where System^CalL^ indicates a System Call (not 
including a System Caff Vectored) in which bit 26 is set. 
Upon return from the interrupt or system call, the previous 
thread state is reloaded in the MSR register from one of 
SRR1 registers 905 or 906. This previous state includes the 
previous value of HV bit 925, and the HV bit is thus reset 
to its previous value 

[0082] In a logically partitioned multiprocessor system, all 
address references in either problem or supervisor state 
should be confined to the logical partition associated with 
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the processor which generated the address. Only in the 
hypervisor state should it be possible to reference an address 
outside this range. 

[0083] FIG. 7 illustrates at a conceptual level the genera- 
tion of effective addresses of instructions in instruction unit 
201. Instruction unit 201 is capable of generating an address 
in any of a variety of ways. 10 Instruction Address Register 
701 represents generation from an address in the instruction 
address register, i.e., an immediate address in the current 
thread. The most common way to generate an address is by 
incrementing this address, represented as path 711. In some 
cases, the address in I0IAR 701 is used directly (e.g., when 
it was loaded into I0IAR 701), represented as path 710. 
Branch Address (relative) block 702 represents a relative 
branching instruction, in which an offset may be contained 
in the instruction or in a register. Because this is a branch 
relative instruction, the offset is added to low order address 
bits from I0IAR, and the high order bits from I0IAR may be 
incremented, decremented, or passed through unchanged. 
These bits are then combined, represented as path 712. 
Bpipe-Base block 703 represents generation of an absolute 
branch through various means, usually using a hardware 
branch pipeline, such as branching to a value from a general 
purpose or a special register, a value derived as a combina- 
tion of bits in the instruction and register bits, indirect 
addressing, etc. SCV block 704 represents an address result- 
ing from a system call or interrupt condition, which branch 
to predefined locations. Fetch Instruction Address Register 
block 705 represents address generation for speculative 
conditions. As shown, this might involve generation of the 
next instruction address for a dormant thread (block 706) or 
SRRO registers which hold return from interrupted 
addresses for the current thread (block 707) and dormant 
thread (block 708), these values being swapped on a thread 
switch. 

[0084] Ideally, instruction unit 201 provides a constant 
stream of instructions for decoding in decoder 206, and 
execution by execution unit 211. LI 1 -cache 106 must 
respond to an access request with minimal delay. Where a 
requested instruction is actually in LI 1 -cache, it must be 
possible to respond and fill the appropriate buffer without 
requiring decoder/dispatcher 206 to wait. Where LI 1 -cache 
can not respond (i.e., the requested instruction is not in 
LI 1 -cache), a longer path via cache fill bus 233 through 
memory management unit 222 must be taken. In this case, 
the instruction may be obtained from L2 cache 108, from 
main memory 102, or potentially from disk or other storage. 
It is also possible that the instruction will be obtained from 
L2 cache of another processor. In all of these cases, the delay 
required to fetch the instruction from a remote location may 
cause instruction unit 201 to switch threads. I.e., the active 
thread becomes inactive, the previously inactive thread 
becomes active, and the instruction unit 201 begins process- 
ing instructions of the previously inactive thread held in 
thread switch buffer 204. 

[0085] FIG. 3 illustrates the major components of Lll- 
cache 106 in greater detail than shown in FIG. 1 or 2, 
according to the preferred embodiment. LI 1 -cache 106 
includes effective-to- real address table (ERAT) 301, 1-cache 
directory array 302, and I-cache instruction array 303. 
I-cache instruction array 303 stores the actual instructions 
which are supplied to instruction unit 201 for execution. 
I-cache directory array 302 contains a collection of real page 



numbers, validity bits, and other information, used to man- 
age instruction array 303, and in particular to determine 
whether a desired instruction is in fact in the instruction 
array 303. ERAT 301 contains pairs of effective page 
numbers and real page numbers, and is used for associating 
effective with real addresses. 

[0086] When instruction unit 201 requests an instruction 
from I-cache 106, providing an effective address of the 
requested instruction, I-cache must rapidly determine 
whether the requested instruction is in fact in the cache, 
return the instruction if it is, and initiate action to obtain the 
instruction from elsewhere (e.g., L2 cache, main memory) if 
it is not. In the normal case where the instruction is in fact 
in LI 1-cache 106, the following actions occur concurrently 
within the I-cache, as illustrated in FIG. 3: 

[0087] (a) The effective address from instruction unit 
201 is used to access an entry in ERAT 301 to derive 
an effective page number and associated real page 
number. 

[0088] (b) The effective address from instruction unit 
201 is used to access an entry in directory array 302 
to derive a pair of real page numbers. 

[0089] (c) The effective address from instruction unit 
201 is used to access an entry in instruction array 303 
to derive a pair of cache lines containing instruc- 
tions. 

[0090] In each case above, the input to any one of ERAT 
301, directory array 302, or instruction array 303, is not 
dependent on the output of any other one of these compo- 
nents, so that none of the above actions need await comple- 
tion of any other before beginning. The output of the ERAT 
301, directory array 302, and instruction array 303 are then 
processed as follows: 

[0091] (a) The effective page number from ERAT 
301 is compared with the same address bits of the 
effective address from instruction unit 201 in com- 
parator 304; if they match, there has been an ERAT 
"hit". (But where addressing in E=>R mode, the 
ERAT is always deemed "hit" regardless of the 
comparison, as explained below.) 

[0092] (b) The real page number from ERAT 301 is 
compared with each of the real page numbers from 
directory array 302 in comparators 305 and 306; if 
either of these match, and if there has been an ERAT 
hit, then there is an I-cache "hit", i.e., the requested 
instruction is in fact in I-cache 106, and specifically, 
in instruction array 303. 

[0093] (c) The output of the comparison of real page 
numbers from ERAT 301 and directory array 302 is 
used to select (using selection multiplexer 307) 
which of the pair of cache lines from instruction 
array 303 contains the desired instruction. 

[0094] Performing these actions concurrently minimizes 
delay where the desired instruction is actually in the I-cache. 
Whether or not the desired instruction is in the I-cache, some 
data will be presented on the I-cache output to instruction 
unit 201. A separate I-cache hit signal will indicate to 
instruction unit 201 that the output data is in fact the desired 
instruction; where the I-cache hit signal absent, instruction 
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unit 201 will ignore the output data. The actions taken by 
I-cache 106 in the event of a cache miss are discussed later 
herein. 

[0095] FIG. 4 shows in greater detail ERAT 301, and 
associated control structures. ERAT 301 is an 82-bitxl28 
array (i.e, contains 128 entries, each having 82 bits). Each 
ERAT entry contains a portion (bits 0-46) of an effective 
address, a portion (bits 24-51) of a real address, and several 
additional bits described below. ERAT 301 may be thought 
of as a small cache directly mapping a subset of effective 
addresses to their respective real addresses, thus avoiding 
the delays inherent in the address translation mechanism 
depicted in FIG. 8, described above. Because ERAT 301 is 
a cache of the larger mapping structures, mapped-to real 
addresses within the ERAT are confined to the logical 
partition of the processor which generated the effective 
address if partition integrity is maintained within the larger 
mapping structures, which is the responsibility of the oper- 
ating system. 

[0096] ERAT 301 is accessed by constructing a hash 
function of bits 45-51 of the effective address (EA), along 
with two control lines: multi-thread control line (MT), 
which indicates whether multithreading is active (in the 
CPU design of the preferred embodiment, it is possible to 
turn multithreading off); and ActThreadlD line 961. The 
hash function (HASH) is as follows: 

HASHm = (£A 4 s AND-. MT) OR (ActThreadlD AND Af T)||EA«I1 
EA 3i XOR EA^^EA^XOR EA 4S \\EA A9 ^ 



[0097] As can be seen, this is a 7-bit function, which is 
sufficient to specify any one of the 128 entries in the ERAT. 
Select logic 401 selects the appropriate ERAT entry in 
accordance with the above hash function. 

[0098] Comparator 304 compares bits 0-46 of the effective 
address generated by instruction unit 201 with the effective 
address portion of the selected ERAT entry. Because bits 
47-51 of the effective address from instruction unit 201 were 
used to construct the hash function, it can be shown that a 
match of bits 0-46 is sufficient to guarantee a match of the 
full effective page number portion of the address, i.e. bits 
0-51. A match of these two address portions means that the 
real page number (RA 24 5I ) in the ERAT entry is in fact the 
real page number corresponding to the effective address 
page number (EAq 51 ) specified by instruction unit 201. For 
this reason, the effective address portion stored in an ERAT 
entry is sometimes loosely referred to as an effective page 
number, although in the preferred embodiment it contains 
only bits 0-46 of the effective page number. 

[0099] Because the ERAT effectively by-passes the 
address translation mechanisms described above and 
depicted in FIG. 8, the ERAT duplicates some of the access 
control information contained in the normal address trans- 
lation mechanism. I.e., a translation of effective address to 
real address will normally verify access rights through 
additional information contained in segment table 821, page 
table 822, or elsewhere. ERAT 301 caches a subset of this 
information to avoid the need to refer to these address 
translation mechanisms. Further information about the 



operation of the ERAT can be found in U.S. Pat. No. 
6,021,481, entitled Effective-To-Real Address Cache Man- 
aging Apparatus and Method, herein incorporated by refer- 
ence. 

[0100] Each ERAT entry contains several parity, protec- 
tion, and access control bits. In particular, each ERAT entry 
includes a cache inhibit bit, a problem state bit, and an 
access control bit. Additionally, separate array 403 (1 bitx 
128) contains a single valid bit associated with each respec- 
tive ERAT entry. Finally, a pair of tag mode bits is stored in 
separate register 404. The valid bit from array 403 records 
whether the corresponding ERAT entry is valid; a variety of 
conditions might cause processor logic (not shown) to reset 
the valid bit, causing a subsequent access to the correspond- 
ing ERAT entry to reload the entry. The cache inhibit bit is 
used to inhibit writing the requested instruction to I-cache 
instruction array 303. I.e., although a range of addresses may 
contain an entry in ERAT, it may be desirable to avoid 
caching instructions in this address range in the I-cache. In 
this case, every request for an instruction in this address 
range will cause the line fill sequence logic (described 
below) to obtain the requested instruction, but the instruc- 
tion will not be written to array 303 (nor will directory array 
302 be updated). The problem state bit records the "problem 
state" of the active thread (from MSR(Pr) bit 923) at the time 
the ERAT entry is loaded. A thread executing in privileged 
state generally has greater access rights than one in problem 
state. If an ERAT entry were loaded during one state, and the 
problem state subsequently changed, there is a risk that the 
currently executing thread should not have access to 
addresses in the range of the ERAT entry, and this informa- 
tion must accordingly be verified when the ERAT is 
accessed. The access control bit also records access infor- 
mation at the time the ERAT entry was loaded, and is 
checked at the time of access. Tag mode bits 404 record the 
tag mode of the processor (tags active or tags inactive) when 
the ERAT was loaded; there is one tag mode bit associated 
with each half (64 entries) of the ERAT, which is selected 
using the 0 bit of the ERAT HASH function. Since tag mode 
affects how effective addresses are interpreted, a change to 
tag mode means that the real page numbers in the ERAT 
entry can not be considered reliable. It is expected that the 
tag mode will change infrequently, if ever. Therefore, if a 
change is detected, all entries in the corresponding half of 
the ERAT are marked invalid, and are eventually reloaded. 

[0101] When CPU 101 is executing in effective=real 
mode, the ERAT is effectively bypassed. In a non-logically 
partitioned system, E=R would imply that the lower order 40 
bits of effective address (i.e., EA^ 63 ) generated by instruc- 
tion unit 201 are the same as the real address (RA^ 63 ), and 
hence any real address is potentially accessible. Logical 
partitioning requires that the effective addresses (base real 
address) be converted to a partitioned real address, i.e. one 
that is confined to some subset of the real address space. 
Bitwise OR logic 422 performs a logical OR of each bit in 
real memory offset register (RMOR) from configuration 
register set 910, with a corresponding bit of effective address 
in the range of bits 24 to 35, i.e., 12 bits in all are ORed. The 
bits in the RMOR correspond to the real address space of a 
logical partition. When using E=R mode and not in hyper- 
visor state, the high order effective address bits in the range 
of those which enforce logical partitioning should all be 
zeroes. OR logic 422 is used instead of simple concatenation 
in order to support logically partitioned real address space 
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subsets of different sizes. In the preferred embodiment, real 
address space subset sizes of 64 GB (236 bytes), 4 GB (232 
bytes) and 256 MB (I 28 bytes) are supported. For example, 
when a partition size of 64 GB is being used, the 4 high order 
bits in RMOR will identify a real address space subset 
allocated to a logical partition, the 8 low order bits of RMOR 
must be set to 0, EA 24 27 must be 0, and EA 28 63 will specify 
a real address within the subset of the logical partition. 
Similarly, where a real address space subset size of 256 MB 
is being used, all 12 bits of the RMOR will identify a real 
address space subset, EA^ 35 must be 0, and EA 36 63 will 
specify a real address within the logical partition. In hyper- 
visor state, a processor has access to the entire real memory 
address space and system resources, and the RMOR is 
therefore by-passed. Additionally, the RMOR is by-passed 
when LPES bit 911 is 0, indicating that computer system 100 
is configured in "AS" environment. As shown in FIG. 4, HV 
bit 620 and LPES bit 911 control multiplexer 421, which 
selects effective address bits 24-35 (EA^ 35 ) if either of 
these conditions is present, and otherwise selects the output 
of OR logic 422. 

[0102] As shown in FIG. 4, when control line E=R is 
active, selection multiplexer 402 selects RA^ 51 from the 
selected ERAT entry as the real page number (RPN) output 
when E=R is false, and multiplexer 402 selects the output of 
multiplexer 421, concatenated with EA 3(5 51 when E=R is 
true. Additionally, where E=R is true, the ERAT is deemed 
to be hit regardless of the comparison result in comparator 
304. 

[0103] ERAT logic 405 generates several control signals 
which control the use of the RPN output of selection 
multiplexer 402 and ERAT maintenance, based on the output 
of selector 304, the effective=real mode, the various bits 
described above, and certain bits in the CPU's Machine 
State Register (or MSRDorm, as the case may be). In 
particular, logic 405 generates ERAT Hit signal 410, Pro- 
tection Exception signal 411, ERAT Miss signal 412, and 
Cache Inhibit signal 413. 

[0104] ERAT Hit signal 410 signifies that the RPN output 
of selection multiplexer 402 may be used as the true real 
page number corresponding to the requested effective 
address. This signal is active when effective«=real (by-pass- 
ing the ERAT); or when comparator 304 detects a match and 
there is no protection exception and certain conditions which 
force an ERAT miss are not present. This can be expressed 
logically as follows: 

ERAT_Hit= (E = R) OR (Match_304 AND -. Prolection_Exc AND 
-i Farcc_Miss) 



[0105] Where Match_304 is the signal from comparator 
304 indicating that EAo A6 from instruction unit 201 matches 
EAq 46 in the ERAT entry. 

[0106] Protection Exception signal 411 signifies that, 
while the ERAT entry contains valid data, the currently 
executing process is not allowed to access it. ERAT Miss 
signal 412 indicates that the requested ERAT entry does not 
contain the desired real page number, or that the entry can 
not be considered reliable; in either case, the ERAT entry 



must be reloaded. Cache inhibit signal 413 prevents the 
requested instruction from being cached in instruction array 
303. These signals are logically derived as follows: 

Force_Miss = -. Valid OR (MSR(Pr) * ERAT{Pr)) OR(MSR(TA) 
* Tag_404) 

Protection_Exc = ->(£ = #) AND -. Force_Miss AND Match_304 AND 
ERA1\AC)AND(MSR{Us) OR -> MSR(TA)) 
ERATJvliss = -(£=/?) AND(-> Match_304 OR Force Jvliss) 
Cache_Inhibit=^ (E = R) AND ERAT[Cl) 



[0107] Where: 

[0108] Valid is the value of valid bit from array 403; 

[0109] ERAT(Pr) is the problem state bit from the 
ERAT entry; 

[0110] ERAT(AC) is the access control bit from the 
ERAT entry; 

[0111] ERAT(CI) is the cache inhibit bit from the 
ERAT entry; 

[0112] MSR(TA) is the tags active bit from the 
Machine State Register; 

[0113] MSR(Us) is the User state bit from the 
Machine State Register; and 

[0114] Tag_404 is the selected tag bit from register 
404. 

[0115] I -cache directory array 302 and contains 512 
entries, each having a pair of real page numbers, validity 
bits, parity bits, and a most-recently-used bit. An entry in 
array 302 is selected using effective address bits 48-56 
(EA^ ?# ), which are used as a sparse hash function. Because 
there is no guarantee that either of the real page numbers 
contained in an entry in array 302 correspond to the full 
effective address page number of the desired instruction, 
both selected real page numbers are simultaneously com- 
pared with the real page number output 411 of ERAT 301, 
using comparators 305 and 306. The output of these and 
certain other logic determines which real page number, if 
any can be used. EA^ 38 simultaneously selects an entry 
from instruction array 303, and the results of comparators 
305, 306 are used to select which set (i.e., which half of the 
entry) contains the associated instruction. 

[0116] The above text describes the situation where the 
instruction sought is actually in the I-cache. Where there has 
been an I-cache miss, there are two possibilities: (a) there 
has been an ERAT hit, but the instruction is not in the 
instruction array; or (b) there has been an ERAT miss. In the 
case where there has been an ERAT hit, it is possible to fill 
the desired cache line significantly faster. Because the real 
page number is in the ERAT, the desired data is known to be 
in main memory (and possibly in an L2 cache). It is possible 
for logic in LI 1 -cache 106 to construct the full real address 
of the desired instruction from ERAT data, without access- 
ing external address translation mechanisms, and to fetch 
this data directly from L2 cache or memory. In the case 
where there has been an ERAT miss, an external address 
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translation mechanism must be accessed in order to con- 
struct the real address of the desired instruction, and to 
update the ERAT as necessary with the new real page 
number. It is possible that in this case, the desired data will 
not exist in main memory at all, and will have to be read in 
from secondary storage such as a disk drive. 

[0117] Further information concerning the operation of 
L-ll-cache 106 is contained in U.S. Pat. No. 6,161,166, 
entitled Instruction Cache for Multithreaded Processor, 
herein incorporated by reference. 

[0118] As described above, OR logic 422 performs a 
logical OR of address bits from the RMOR and the effective 
address to create an logically partitioned effective address 
which is offset from the effective address generated by 
instruction unit 201. The use of OR logic presumes that 
certain high order bits of the effective address are zeroes, 
otherwise the bits identifying the logical partition can be 
corrupted. These conditions and others are verified by 
address protection logic shown in FIG. 6. 

[0119] As shown in FIG. 6, the 2-bit real memory limit 
register (RMLR) 913 and effective address bits 24-35 (EA 
as) are input to partition size decode logic 601. The RMLR 
designates the size of the logical partitions, as follows: 



RMLR value: 0 0 Partition size: 64 GB 

10 4 GB 

1 1 256 MB 



[0120] Decode logic 601 outputs an address-out-of-range 
signal 604, a single bit value which is a logic '1' if the 
effective address runs outside the established partition size 
as specified in the RMLR. The logic function performed by 
decode logic 601 can be expressed as: 

AOR = EA 24 OR £425 OR EA 2 t> OR EA 27 OR (RMLR 0 AND £4 2 s) OR 
(RMLR 0 AND EA N ) OR (RMLR 0 AND £ , /A J0 )OR 
(RMLRo AND EA n ) OR («A#ZJf, AND £A 32 )OR 
(RMLRi AND £A 33 ) OR {RMLR t AND £A M )OR 
{RMLR { AND £A„) 



[0121] Decode logic 601 generates an AOR signal as 
described above for all effective addresses generated by 
instruction unit 201. However, the signal is significant only 
if certain conditions are met. Specifically, if the effective 
address is translated through the address translation mecha- 
nism shown in FIG. 8, then the AOR signal is ignored 
because there is no correspondence of high order effective 
address bits and real address bits, and logical partitioning 
code under the control of the operating system assures that 
values in the translation tables enforce logical partitioning. 
The AOR signal is also ignored if the thread for which the 
address is generated is in hypervisor state, since such a 
thread is authorized to access all logical partitions. Finally, 
the AOR signal is ignored if the LPES bit is 0 (indicating an 
"AS" system environment). 

[0122] The logic which performs these functions is shown 
in FIG. 6 as selectors 610, 611, and RS real address space 



check logic 602. Selector 610 selects either MSR(IR) signal 
922 or MSRDorm(IR) signal 932, depending on incoming 
signal from DTA (Dormant Thread Address access select) 
line 612. DTA line 612 is active when the effective address 
is generated on behalf of the dormant thread, i.e., in the case 
of a background fetch of the dormant thread's instructions. 
In all other cases, the DTA line is low, indicating that the 
address is generated on behalf of the active thread. Selector 

610 outputs on IR line 621 the MSRDorm(IR) signal if DTA 
line 612 is active, otherwise outputs the MSR(IR) signal. 
Selector 611 similarly selects either MSR(HV) signal 925 or 
MSRDorm(HV) signal 935, depending on DTA input 612. 
The output of selector 611, designated HV 620, is also used 
as input to multiplexer 421. The outputs of selectors 610 and 

611 can be logically expressed as follows: 

[0123] IR=(->DTA AND MSR(IR)) OR (DTA AND 
MSRDormpR)) 

[0124] HV=(iDTAAND MSR(HV)) OR (DTAAND 
MSRDorm(HV)) 

[0125] The output of RS real address space check logic 
602 can be expressed as follows: 

[0126] RS_check-LPES AND AOR AND -iHVAND 
-«IR 

[0127] Where an "AS" mode operating system is used, AS 
real address space check logic 603 will generate an AS check 
signal if there is an attempt to generate an address in E=R 
mode, while not in hypervisor state. In other words, when in 
"AS" mode, E=R addressing can only be used in hypervisor 
state. The output of AS real address space check logic 603 
can be expressed as follows: 

[0128] AS_check«-«LPES AND -.HV AND -IR 

[0129] As shown in FIG. 6, an instruction storage inter- 
rupt is generated if there is an AS check or if there is and RS 
check, i.e. 

[0130] LPAR ISI=>AS_check OR RS_check 

[0131] This is simply one set of possible conditions which 
may cause an interrupt. A protection exception signal 411 
(explained above) also causes an instruction storage inter- 
rupt, as do various other conditions. The effect of the 
instruction storage interrupt is that the generated address is 
not accessed by the processor, and appropriate interrupt 
routines are called. 

[0132] The above text and accompanying figures explain 
how addresses of instructions are verified and mapped to an 
address range corresponding to the logical partition of the 
processor which generated the address. Addresses of data are 
processed in a similar, although simplified, manner. Data 
addresses are processed using the logic depicted in FIGS. 5 
and 6. Much of the logic which processes data addresses is 
physically separate from the logic which processes instruc- 
tion addresses, although the two operate in a similar manner. 

[0133] Unlike instructions (which may be pre -fetched for 
either the active or dormant thread), only the active thread 
generates data addresses. Therefore some of the logic shown 
in FIG. 6, which is required to process pre-fetched instruc- 
tion addresses for a dormant thread, is not needed in the case 
of data addresses. Additionally, the LI data cache does not 
use an ERAT. 
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[0134] FIG. 5 depicts the real address partitioning mecha- 
nism for data addresses, analogous to the partitioning 
mechanism for instruction addresses shown in FIG. 4, 
Execution unit 211 generates a data address by any of 
various conventional means known in the art, e.g., as a value 
taken from a register, as a field of an instruction concat- 
enated or offset from a register value, as a computed value 
from multiple registers, etc. The effective address may or 
may not require translation. Where translation is indicated 
(E=R is false), address bits 0-51 are input to the translation 
mechanism (depicted at a high level in FIG. 8), which 
produces a translated 28-bit real page number. Where trans- 
lation is not indicated (E=R is true), a partitioned real 
address is produced from the effective address in a manner 
similar to that explained above for instruction addresses. I.e., 
EA^ 3S is bitwise ORed with the contents of real memory 
offset register 912 by OR logic 522. Multiplexer 521 selects 
EA 24 35 if either HV 620 or LPES 911 is true, otherwise 
selects the output of logic 522. The output of multiplexer 
521 is concatenated with EA 36 51 for input to multiplexer 
502. Multiplexer 502 chooses either translated 28-bit real 
page number from the translation mechanism, or the output 
of multiplexer 521, as the 28-bit real page number, i.e., real 
address bits 24 to 51. The 12-bit byte index within the real 
page is taken directly from EA 52 63 . 
[0135] As in the case of instruction addresses, separate 
logic circuitry for data addresses produces an error signal. 
This logic is similar to that shown in FIG. 6, but simplified. 
In particular, because data addresses are only generated on 
behalf of the currently active thread, selectors 610 and 611 
are not used in the logic which checks for LPAR address 
errors in data addresses. I.e., in the case of data addresses, 
HV=MSR(HV) and DR-MSR(DR). AS Real Address Space 
Check logic 603 is similarly simplified because only 
MSR(TA) and MSR(Pr) (and not MSRDorm(TA) and MSR- 
Dorm(Pr)) are used as input. 

[0136] LPID 914 is used as a tag in certain bus operations 
to identify the relevant logical partition, thus limiting the 
effect of the bus operation and improving efficiency. A 
processor receiving data in such an operation from a bus to 
which it is attached will compare the tag received on the bus 
(the logical partition ID to which the operation pertains) 
with its own logical partition ID stored in its configuration 
register 910. If the two are not identical, the operation is 
ignored by the processor. 

[0137] A simple example will demonstrate the potential 
performance improvement of this arrangement. ERAT301 is 
essentially a cache of some of the information contained in 
segment table 821 and page table 822, the segment and page 
tables being external to the processor. Each logical partition 
has its own segment and page tables, which are maintained 
independently of those in other logical partitions. Since a 
logical partition may contain multiple processors, activity in 
another processor may cause a page fault or other condition 
which alters the contents of one or the other of these tables. 
In that event, the corresponding ERAT entries may be 
affected. Therefore, whenever the segment table or page 
table are modified, an appropriate message will be broadcast 
to all processors on the bus, so that each may invalidate any 
affected ERAT entry. If, however, a processor is in a different 
logical partition, its ERAT is not affected by such a change. 
By comparing the LPID in the bus tag with the processor's 
own LPID in its configuration register, the processor knows 
immediately (e.g., at the bus interface 225, without access- 
ing ERAT 301) whether the bus message pertains to it, and 
can safely ignore any page table or segment table changes in 
for different logical partition. 



[0138] The ability of code in hypervisor state to alter the 
information in configuration register 910 means that the 
logical partitioning of a system can be dynamically changed. 
E.g., processors and other resources can be re-allocated to 
different logical partitions, the address ranges associated 
with a logical partition can be altered, or partitioning can be 
turned off entirely. Since only code executing in hypervisor 
state can alter these registers, the system is protected from 
accidental re-configuration by user code. 

[0139] Additional background information concerning an 
exemplary (although by no means the only possible) hyper- 
visor implementation can be found in commonly assigned 
copending U.S. patent application Ser. No. 09/314,214, filed 
May 19, 1999, entitled Logical Partition Manager and 
Method, herein incorporated by reference. 

[0140] It will be understood that certain logic circuitry not 
essential to an understanding of the present invention has 
been omitted from the drawings and description herein for 
clarity. For example, logic for maintaining the MRU bit in 
array 302, logic for detecting parity errors and taking 
appropriate corrective action, etc., have been omitted. 

[0141] In the preferred embodiment, a multithreaded pro- 
cessor employing coarse-grained hardware multithreading 
concepts is used. However, it will be understood that as 
alternative embodiments it would be possible to employ 
fine-grained multithreading operation, in which execution 
among the various threads is rotated on a cycle -by-cycle 
basis. It would also be possible to support logical partition- 
ing as described herein on a processor which does not have 
hardware multithreading support. 

[0142] While the invention has been described in connec- 
tion with what is currently considered the most practical and 
preferred embodiments, it is to be understood that the 
invention is not limited to the disclosed embodiments, but on 
the contrary, is intended to cover various modifications and 
equivalent arrangements included within the spirit and scope 
of the appended claims. 

What is claimed is: 

1. A computer processing apparatus, comprising; 

effective address generating logic, said effective address 
generating logic generating effective addresses to be 
accessed, wherein at least some of said effective 
addresses are real addresses; 

at least one state register for recording processor operat- 
ing parameters, said at least one state register including 
a mode designator, said mode designator designating an 
operating mode; 

real address range checking logic, said real address range 
checking logic generating an error signal responsive to 
detecting a real address outside a predetermined range 
when said real address is associated with a first oper- 
ating mode, and not generating an error signal respon- 
sive to detecting a real address outside said predeter- 
mined range when said real address is associated with 
a second operating mode. 

2. The computer processing apparatus of claim 1, further 
comprising: 

a plurality of sets of registers for supporting the execution 
of a plurality of threads, each set of registers corre- 
sponding to a respective one of said plurality of 
threads; 
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wherein said at least one state register for recording 
processor operating parameters comprises a respective 
mode designator associated with each thread, said 
mode designator designating, for each of said threads 
independently, a respective operating mode; and 

wherein said real address range checking logic generates 
an error signal responsive to detecting a real address 
outside said predetermined range when said real 
address is generated on behalf of a thread in said first 
operating mode, and does not generate an error signal 
responsive to detecting a real address outside said 
predetermined range when said real address is gener- 
ated on behalf of a thread in said second operating 
mode. 

3. The computer processing apparatus of claim 2, wherein 
said effective address generating logic comprises an instruc- 
tion unit for generating effective addresses of instructions 
for execution by said processor, said instruction unit gener- 
ating effective addresses on behalf of an active thread and on 
behalf of at least one dormant thread, said real address range 
checking logic selectively generating an error signal respon- 
sive to detecting a real address outside said predetermined 
range on behalf of an active thread and on behalf of at least 
one dormant thread responsive to said mode designator 
associated with the respective thread. 

4, The computer processing apparatus of claim 1, 

wherein at least some of said effective addresses gener- 
ated by said effective address generation logic are 
translatable addresses intended for translation from an 
effective address to a real address; and 

wherein said computer processing apparatus further com- 
prises real address selection logic for selectively out- 
putting a real address translated from said effective 
address when said effective address is a translatable 
address, and outputting a real address generated by said 
real address partitioning logic when said effective 
address is a base real address. 

5, The computer processing apparatus of claim 4, further 
comprising: 

an effective -to-real address translation table, wherein at 
least some of said effective addresses which are trans- 
latable addresses are translated from an effective 
address to a real address by reference to said effective - 
to -real address translation table. 

6. The computer processing apparatus of claim 1, wherein 
said real address range checking logic further comprises: 

a real memory limit register specifying a range of address 
bits of said effective address; 

AND logic performing a plurality of logical ANDs of 
each of a plurality of address bits derived from said 
effective address generated by effective address gen- 
eration logic with a respective bit from a mask gener- 
ated from a value in said real memory limit register to 
produce a masked portion of said effective address; and 

OR logic for performing a logical OR of a plurality of 
address bits derived from said effective address gener- 
ated by said effective address generation logic, said 
plurality of address bits derived from said effective 
address including said masked portion of said effective 
address generated by said effective address generation 
logic. 



7. The computer processing apparatus of claim 1, wherein 
said mode designator is placed in said second operating 
mode only upon occurrence of one of a set of predefined 
events. 

8. The computer processing apparatus of claim 7, wherein 
each said predefined event of said set of predefined events 
causes said computer processing apparatus to branch to a 
respected predefined real memory address. 

9. The computer processing apparatus of claim 7, wherein 
a state represented by said at least one state register is saved 
in a saved state register upon occurrence of one of said set 
of predefined events, and restored to said at least one state 
register upon return from processing said one of said set of 
predefined events. 

10. A computer system, comprising: 

a plurality of processors; 

a main memory addressable using a real address; 

a logical partitioning mechanism capable of partitioning 
said computer system into a plurality of logical parti- 
tions, each processor of said plurality of processors 
being assigned by said logical partitioning mechanism 
to a respective one of said logical partitions, each 
logical partition being assigned a corresponding 
respective discrete range of real addresses of said main 
memory; 

wherein each of said plurality of processors comprises: 

a mode designator, said mode designator designating an 
operating mode for a thread executing in said pro- 
cessor; and 

real address range checking logic, said real address 
range checking logic generating an error signal 
responsive to detecting a real address outside the 
corresponding respective discrete range of real 
addresses assigned to the logical partition to which 
the respective processor is assigned when said real 
address is generated on behalf of a thread executing 
in a first operating mode, and not generating an error 
signal responsive to detecting a real address outside 
said corresponding respective discrete range of real 
addresses assigned to the logical partition to which 
the respective processor is assigned when said real 
address is generated on behalf of a thread executing 
in a second operating mode, 

11. The computer system claim 10, wherein each of said 
plurality of processors supports the execution of a plurality 
of threads and further comprises: 

a plurality of mode designators, each mode designator 
corresponding to a respective one of said plurality of 
threads and designating, for each of said threads inde- 
pendendy, a respective operating mode; 

wherein said real address range checking logic generates 
an error signal responsive to detecting a real address 
outside said corresponding respective discrete range of 
real addresses assigned to the logical partition to which 
the respective processor is assigned when said real 
address is generated on behalf of a thread in said first 
operating mode, and does not generate an error signal 
responsive to detecting a real address outside said 
corresponding respective discrete range of real 
addresses assigned to the logical partition to which the 
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respective processor is assigned when said real address 
is generated on behalf of a thread in said second 
operating mode. 

12. The computer system of claim 10, wherein said real 
address range checking logic in each said processor further 
comprises: 

a real memory limit register specifying a range of address 
bits of an effective address generated by effective 
address generation logic in said processor; 

AND logic performing a plurality of logical ANDs of 
each of a plurality of address bits derived from said 
effective address generated by said effective address 
generation logic with a respective bit from a mask 
generated from a value in said real memory limit 
register to produce a masked portion of said effective 
address; and 

OR logic for performing a logical OR of a plurality of 
address bits derived from said effective address gener- 
ated by said effective address generation logic, said 
plurality of address bits derived from said effective 
address including said masked portion of said effective 
address generated by said effective address generation 
logic. 

13. The computer system of claim 10, 

wherein each of said plurality of processors generates 
effective addresses on behalf of threads executing in the 
processor, wherein at least some of said effective 
addresses are translatable addresses intended for trans- 
lation from an effective address to a real address; and 

wherein each of said plurality of processors further com- 
prises real address selection logic for selectively out- 
putting a real address translated from said effective 
address when said effective address is a translatable 
address, and outputting a real address generated by said 
real address partitioning logic when said effective 
address is a base real address. 

14. The computer system of claim 13, 

wherein each of said plurality of processors further com- 
prises an effective-to-real address translation table, 
wherein at least some of said effective addresses which 
are translatable addresses are translated from an effec- 
tive address to a real address by reference to said 
effective -to -real address translation table. 

15. A computer processing apparatus for supporting a 
computer system divisible into a plurality of logical parti- 
tions, each partition having a respective portion of the real 
address space of said computer system, said computer 
processing apparatus comprising: 

means for maintaining state information for at least one 
thread, said state information including an operating 
mode; 

means for generating effective addresses for said at least 
one thread, wherein at least some of said effective 
addresses are base real addresses not intended for 
translation; and 

means for detecting a real address outside a predeter- 
mined range and generating an error signal responsive 
thereto when said real address is generated for a thread 
executing in a first operating mode, and not generating 



an error signal when said real address is generated for 
a thread executing in a second operating mode. 

16. The computer processing apparatus of claim 14, 

wherein said means for maintaining state information 
maintains state information for each of a plurality of 
threads independently; and 

wherein said means for detecting a real address outside a 
predetermined range generates an error signal respon- 
sive to state information for the thread for which the 
real address is generated. 

17. A computer processing apparatus, comprising: 

a configuration register for recording configuration infor- 
mation, said configuration information including a pro- 
cessor logical partition identifier; 

at least one state register for recording processor operat- 
ing parameters, said at least one state register including 
a mode designator, said mode designator designating an 
operating mode; 

execution logic for executing instructions, said instruc- 
tions including at least one instruction for altering said 
processor logical partition identifier, wherein said 
execution logic executes said at least one instruction for 
altering said processor logical partition identifier when 
in a first operating mode, and does not execute said at 
least one instruction for altering said processor logical 
partition identifier when in a second operating mode; 
and 

bus interface logic, said bus interface logic receiving 
communications on a bus, at least some of said com- 
munications including a respective bus logical partition 
identifier; 

wherein said processing apparatus ignores a communica- 
tion including a bus logical partition identifier if said 
bus logical partition identifier does not match said 
processor logical partition identifier. 

18. The computer processing apparatus of claim 15, 
further comprising: 

a plurality of sets of registers for supporting the execution 
of a plurality of threads, each set of registers corre- 
sponding to a respective one of said plurality of 
threads; 

wherein said at least one state register for recording 
processor operating parameters comprises a respective 
mode designator associated with each thread, said 
mode designator designating, for each of said threads 
independently, a respective operating mode; and 

wherein said execution logic executes said at least one 
instruction for altering said processor logical partition 
identifier when said at least one instruction for altering 
said processor logical partition identifier is part of a 
thread executing in said first operating mode, and does 
not execute said at least one instruction for altering said 
processor logical partition identifier when said at least 
one instruction for altering said processor logical par- 
tition identifier is part of a thread executing in said 
second operating mode. 
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SHARED MEMORY MULTIPROCESSOR 
SYSTEM AND METHOD WITH ADDRESS 
TRANSLATION BETWEEN PARTITIONS 
AND RESETTING OF NODES INCLUDED IN 
OTHER PARTITIONS 

CROSS-REFERENCE TO RELATED 
APPLICATION 

The present application relates to subject matter described 
in application Ser. No. 09/030,957 filed Feb. 26, 1998 
entitled "Shared Memory Multiprocessor", now U.S. Pat. 
No. 6,088,770, the disclosure of which is hereby incorpo- 
rated by reference. 

BACKGROUND OF THE INVENTION 

The present invention relates to a shared memory parallel 
processor system used in information processing 
apparatuses, especially used in personal computers (PCs), 
work stations (WSs), and server machines. In particular, the 
present invention relates to a control scheme of a shared 
memory between partitions. 

In recent years, use of the architecture of the shared 
memory multiprocessor as a host module of the parallel 
processors has spread. In this architecture, such a configu- 
ration where several tens to several hundreds processors 
share a main memory is needed in some cases in order to 
improve the performance. As the configuration method of 
the shared memory multiprocessor, bus connection sym- 
metrical multiprocessors (SMPs) used in personal comput- 
ers are typical. Since the bus throughput forms a bottleneck 
in the bus connection SMPs, however, the number of pro- 
cessors which can be connected is limited to approximately 
four. Thus the bus connection SMP is not suitable for such 
a scheme as to connect a large number of processors. 

In order to solve the above described problem, there has 
been proposed a method of connection bus connecting SMPs 
hierarchically by using a crossbar switch or the like. A 
typical example of the hierarchical SMP is found in 
"Gigaplane — XB: Extending the Ultra Enterprise Family", 
HOT Interconnects V, pp. 97 to 112, August 1997. The 
crossbar switch or the like between nodes logically functions 
as a bus. Coherence of a CPU cache between nodes of the 
bus connection SMP having processors and main memories 
can be managed at high speed by using a bus snoop protocol. 

As one of problems of the large scale shared memory 
multiprocessor as described above, there is reliability. In 
conventional shared memory multiprocessors, the whole 
system has one Operation System (OS). Since all processors 
of the system can be managed by one OS, this scheme has 
an advantage that flexible system operation (such as load 
distribution) can be conducted. However, this scheme has a 
drawback that the system reliability falls in the case where 
a large number of processors are connected in a shared 
memory multiprocessor configuration. In a server of a 
cluster configuration in which a plurality of processors are 
connected by a network, respective nodes have different 
OSs. Even if a fatal error such as a bug or the like of an OS 
or the like occurs, only the corresponding node suffers from 
a system down state. On the other hand, if a certain proces- 
sor is brought into a down state by a system bug or the like 
in the case where the whole system is controlled by one OS 
in a shared memory multiprocessor, the OS is brought into 
a down state and consequently all processors are affected. 

In order to solve this problem, there has been proposed 
such a scheme that the inside of a shared memory multi- 
processor is divided into a plurality of partitions and a 
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plurality of OSs are run independently. Each partition has an 
independent main memory. A processor of a certain partition 
basically accesses only the main memory of its own parti- 
tion. As a result, it becomes possible to realize the fault 
5 containment between partitions and improve the system 
performance. 

Furthermore, also for improving the operation perfor- 
mance and reducing the management cost using server 
consolidation, it is desired to integrate works which have 
been conducted by a plurality of servers into one highly 

io multiplexed server. The above described partition technique 
is indispensable. 

In the case where a shared memory multiprocessor is 
divided into partitions, how communication is conducted 
between partitions poses a problem. A scheme in which 

15 communication between partitions is conducted by making 
efficient use of a shared memory mechanism provided in a 
system before partitioning is at advantage in performance. 
Therefore, realization of a shared memory between parti- 
tions becomes necessary. 
A partition technique of making a plurality of OSs run in 

20 one system has been used heretofore in mainframes, and it 
has been disclosed in U.S. Pat. No. 4,843,541. In this 
scheme, it is possible to make a plurality of guest OSs 
operate under the management of a host OS which manages 
the whole system. Respective guest OSs are independent 

25 systems having different address spaces. Access to a main 
memory in each partition is conducted according to the 
following procedure. 

(1) A virtual address of a guest is translated to a real 
address. 

30 (2) The above described guest real address is translated to 
a main memory address in the host. 
(3) The main memory is accessed by using the main 

memory address in the host derived in (2). 
The above described address translation of the two stages 
35 must be conducted between a CPU and the main memory. 
In the partitions of the main frame, it is made possible for 
respective guest partitions to have different address spaces 
and the fault containment is realized by conducting the 
above described address translation of the two stages. By 
40 overlapping addresses of guests in the address translation of 
(3), the shared memory can be realized. 

In realizing a partition mechanism and an inter-partition 
shared memory of a hierarchical bus connection SMP by 
using the above described conventional techniques, there are 
problems described hereafter. 
45 The conventional inter-partition shared memory mecha- 
nism is premised on a concentrated main memory architec- 
ture having an address translation mechanism of two stages 
between each CPU and the main memory. Therefore, the 
conventional inter-partition shared memory mechanism is 
50 largely different in architecture from the hierarchical bus 
connection SMP Accordingly, the conventional technique 
cannot be applied to the hierarchical bus connection SMP as 
it is. In particular, respective CPUS use standard compo- 
nents. As a result, the address translation of the two stages 
55 used in the conventional technique cannot be conducted in 
the CPU, and relocation of the address of each partition 
(guest) cannot be conducted. 

Furthermore, in the hierarchical SMP, the CPU cache 
coherence is kept at a high speed by using the bus snoop 
6Q protocol. Therefore, the inter-partition shared memory 
mechanism needs to be capable of supporting the bus snoop 
protocol. 

SUMMARY OF THE INVENTION 

Therefore, an object of the present invention is to realize 
65 a partition mechanism and an inter-partition shared memory 
mechanism suitable for the architecture of the hierarchical 
SMP. 
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In addition, a future parallel system must support a 
general purpose OS. Accordingly, the partition system needs 
to have a general purpose architecture which does not 
depend on a specific OS. It is necessary to make it possible 
for each partition to have a free address space. In addition, 
it is necessary to realize dynamic generation and erasing of 
a partition in order to deal with a large number of applica- 
tions and improve the rehab ility of the system by using 
dynamic reconfiguration of partitions. 

Another object of the present invention is to flexibly 
manage the configuration of the inter-partition shared 
memory. 

In addition, the partition system needs to realize high 
reliability at a low cost. Thus it is indispensable for partitions 
to back up each other. Therefore, a third object of the present 
invention is to facilitate recovery from an error from another 
partition in the case where the OS of a certain partition 
suffers from system down. 

In order to achieve the above described first and second 
objects, such a hierarchical SMP that nodes each having 
CPUs coupled by a bus and a main memory are connected 
by a switch and cache coherence control is conducted 
through the switch, at the gateway of the switch from each 
node when the inside of the system is divided into partitions 
in each of which a different OS operates, with means for 
mutually translating an address of an access command for an 
area shared between partitions, between a real address used 
in a partition and an address used in common between 
partitions. As a result, the address of a local area of each 
partition is freely set. In addition, cache coherence control of 
the shared area can be conducted at high speed by using a 
snoop command of the hierarchical SMP. 

Furthermore, in another preferred aspect of the present 
invention, conformity between the address of the access 
command issued from another partition and the configura- 
tion of the shared area is checked at the gateway of each 
node. As a result, fault containment can be realized between 
partitions. 

Furthermore, in another preferred aspect of the present 
invention, there is provided apparatus for the system soft- 
ware to dynamically modify the configuration information 
of the shared area between partitions. As a result, flexible 
management of the shared area becomes possible. 

In addition, in order to achieve the above described third 
object, each partition is provided with a function of resetting 
CPUs of other partitions. In the case where a certain 
partition suffers from system down, it is possible to reset and 
re-initialize the partitions which have suffered from system 
down. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows a shared memory multiprocessor having a 
partition mechanism according to the present invention; 

FIG. 2 shows an example of partitioning in a shared 
memory multiprocessor according to the present invention; 

FIG. 3 shows an example of address spaces of respective 
partitions in a shared memory multiprocessor according to 
the present invention; 

FIG. 4 shows a summary table of relations between access 
addresses and operations of address translation circuits of 
respective nodes; 

FIG. 5 shows the configuration of an address translation 
table of each node; 

FIG. 6 shows a detailed diagram of an output side 
translator of each node; 
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FIG. 7 shows the configuration of partition configuration 
information of respective nodes; 

FIG. 8 shows the configuration of main memory configu- 
ration information in partition of respective nodes; 
5 FIG. 9 shows a detailed diagram of a shared area detect 
circuit of each node; 

FIG 10 shows a table describing relations between access 
commands and a destination indicate and global bit addition 
circuit of respective nodes; 

FIG. 11 shows a detailed diagram of a shared area detect 
and address check circuit of each node; 

FIG. 12 shows a flow diagram showing a process con- 
ducted when a home partition ensures a sa in memory in a 
15 multiprocessor system of the present invention; 

FIG. 13 shows a flow diagram showing a process con- 
ducted on an added partition side when a shared partition is 
added in a multiprocessor system of the present invention; 

FIG. 14 shows a flow diagram showing a process con- 
20 ducted on a home partition side when a shared partition is 
added in a multiprocessor system of the present invention; 

FIG. 15 shows a flow diagram showing a process con- 
ducted on a home partition side when a shared area is erased 
in a multiprocessor system of the present invention; 
25 FIG. 16 shows a flow diagram showing a process con- 
ducted on a side of a partition which imports the shared area 
when a shared area is erased in a multiprocessor system of 
the present invention; and 

FIG. 17 shows the details of a reset circuit in a multipro- 
cessor system of the present invention. 

DESCRIPTION OF THE EMBODIMENTS 

FIG. 1 is a block diagram of a shared memory multipro- 

35 cessor according to the present invention. The present sys- 
tem is an 8-node system. A plurality of nodes 100 to 800 
(which are sometimes referred to as nodes 0 to 7) are 
connected by a bus or internode connection switch 900. Six 
nodes other than the nodes 100 and 800 are omitted in FIG. 

4Q 1 in order to avoid the complexity. However, every node has 
the same structure. By taking the node 0 as a representative 
example, the structure will be described. The node has a 
main memory 180 and four processors, i.e., Central Proces- 
sor Units (CPUs) 110, 111, 112 and 113. These CPUs are 

45 coupled to a node controller 120 via a CPU bus 190. The 
node controller 120 includes a main memory access circuit 
130, an address translation circuit 140, partition configura- 
tion information 160, and main memory configuration infor- 
mation partition 165. The main memory 180 forms a part of 

5Q a main memory common to this system, and holds a part of 
a program executed by each node and data. This system is 
the so-called parallel processor system of distributed shared 
memory type. 

The cache coherence between CPUs in the node is man- 
55 aged by a bus snoop protocol. As the bus snoop technique, 
a known technique is employed. In FIG. 1, the CPUS in the 
node are connected by the bus. As for the hardware, 
however, a coupling scheme other than the bus, such as 
one-to-one coupling or coupling using a switch, may also be 
60 used. As for these connection methods in the nodes, various 
known techniques can be applied. 

The cache coherence between the CPUs of the nodes is 
also managed by using the bus snoop protocol. The inter- 
node connection switch 900 logically functions in the same 
65 way as the bus. 

The address translation circuit 140 is hardware for trans- 
lation between an address in the node (an address used by 
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the CPUs 110 to 113, the CPU bus 190, and the main are included in the node. In addition to the above described 

memory 180) and an address used by the inter-node con- real address spaces of respective partitions, the system has 

nection switch 900 (an address outside the node). a shared memory space (shown on the right hand side of 

Within the address translation circuit 140, the side for FIG. 3) common to the whole system. An area shared by 

outputting a command from the inside of the node includes 5 partitions have different addresses within the partitions, but 

a shared area detect circuit 141 for determining whether the has a common address in the shared memory space, 

accessed area is a shared area, an output side translator 144 Access from each partition to a shared area is conveyed to 

for translating the address of the shared area, and a desti- another partition via a shared memory space address. In 

nation indicate and global bit addition circuit 142 for send- other words, an access command such as a cache coherent 

ing out a command to proper destination. The side for 10 check command (CCC command) issued to a shared area by 

inputting a command from the outside of the node includes each node is transformed from the real address of its own 

a shared area detect and address check circuit 143 for partition to a shared memory space address when the access 

determining whether a command from another node is a command is outputted from the accessing node. When the 

command directed to a shared area and checking the address, access command is inputted to an accessed node, the access 

an input side translator 145 for translating the address of a 15 command is transformed from the shared memory space 

shared area, a DND acknowledge circuit 146 for issuing an address to a real address of an accessed partition. These 

acknowledge in response to an access request for a shared address transformation operations are conducted by the 

area which is not shared by its own node, and an acknowl- address translation circuit 140 located at the gateway of each 

edge wait circuit 147 for waiting an acknowledge of a snoop node. Owing to this address translation function, it is pos- 

command from another node. These circuits performs opera- 20 sible in each partition to freely determine a real address 

tions specific to the present invention. space which exceeds the range of the physical main memory 

Each node has an inter-OS communication function 170 of a node its own partition, 

to be used for system initialization process, configuration It should be noted that in the case where the bus snoop 

control, debug and so on. Without using the shared memory, protocol is used there is a possibility that a CCC command 

system software of each node can perform communication. 25 will be issued by an arbitrary CPU which shares a certain 

Physically, an inter-OS communication path 910 can also cache line and the CCC command might be broadcasted in 

share the same hardware as the inter-node connection some cases to all CPUS which may possibly share the 

switch. Alternatively, the inter-OS communication path 910 pertinent line. Therefore, address translation conducted in 

can also use a typical network such as a LAN. each partition between the local real address of a partition 

FIG. 2 shows a configuration example of partitioning in a 30 and the shared space address must be translatable in both 

hierarchical SMP of the present invention. Partitioning is directions. Furthermore, also when accessing a shared area 

conducted by taking a SMP node as the unit, and an arbitrary which has a physical main memory in the same partition 

combination of nodes can be adopted. In the example shown (such as when one of the nodes 0 to 3 accesses a shared area 

in FIG. 2, the inside of the system is divided into three A), an address in the CCC command must be translated to 

partitions. A first partition 1000 includes nodes 0 to 3 (100, an address of the shared memory space once in order to 

200, 300 and 400). A second partition 1001 includes nodes correctly convey the CCC command to nodes located out- 

4 and 5 (500 and 600). A third partition 1002 includes nodes side the partition as well. For access to a not-shared area (an 

6 and 7 (700 and 800). Respective partitions are independent area which is accessed by only nodes included in the 

systems in that different OSs operate. Respective partitions partition), address translation is not conducted at all (the 

have independent address spaces as hereafter described. In CCC command is sent to only the nodes included in the 

addition, partitions have a shared memory which is a part of partition). At that time, it is impossible to determine on the 

the main memory and which is shared by partitions. basis of the address alone whether a command inputted from 

In the case where there is not a memory shared by the switch is access to a not-shared space or access to a space 

partitions, exchange of data between partitions must be 45 shared by partitions. Therefore, each ^command in the switch 

conducted by using a message communication mechanism has a S lobal blt whereby it can be determined whether the 

or a typical network (LAN). Since the latency of message command is access to a shared space, 

communication requires approximately at least 10 ^s, it is In accordance with a feature of the present invention, the 

difficult to realize high speed communication. If a hierar- address translation circuit 140 is provided in the gateway of 

chical bus connection shared memory mechanism is utilized, 50 each node to the switch, and mutual translation between the 

however, communication between nodes can be realized in address of a shared area in each node and its shared memory 

several hundreds ns or less. Therefore, the inter-partition space address is conducted. 

shared memory is indispensable for realizing faster commu- Hereafter, an address management scheme between par- 

nication between partitions, faster resource lock processing titions will be described in detail by referring to the example 

between partitions, and faster fail over between partitions. 55 shown in FIG. 3. In FIG. 3, the address space of the first 

FIG. 3 shows an example of address maps of respective partition 1000 has local main memories 10, 20, 30 and 40 

partitions in the case where a shared memory mechanism of respectively of the nodes 0 to 3. The address space of the 

partitions is used. This example corresponds to the parti- second partition 1001 has local main memories 50 and 60 

tioning exemplified in FIG. 2. respectively of the nodes 4 and 5 and a shared area la. The 

Respective partitions (partitions 0 to 2) have real address 60 address space of the third partition 1002 has local main 

spaces which are independent from partition to partition (as memories 70 and 80 respectively of the nodes 6 and 7 and 

shown on the left hand side of FIG. 3). Respective nodes in shared areas lb and 2a. All of the nodes included in each 

the same partition have the same real address space. The real partition have the same address space. For example, both the 

address of the above described partition is the address used nodes 4 and 5 can access the areas 50, 60 and l& 

in the CPU bus 190 in nodes included in the partition. It is 65 As for shared areas, there are two shared areas A and B. 

a real address to be used by the CPUs 110 to 113, the main The shared area A is allocated onto the local main memory 

memory 180 and the main memory access circuit 130 which of the node 0, and the shared area B is allocated onto the 
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local main memory of the node 4. They are represented as 
shared area A (physical main memory) 1 and shared area B 
(physical main memory) 2, respectively. Partitions having 
main memories corresponding to these shared areas are 
hereafter referred to as partitions of export side. On the other 
hand, in the shared memory space, the shared area A and the 
shared area B are located in lz and 2z } respectively. 
Therefore, the partition 0 maps (exports) the shared area A 
from the real address space 1 in the partition to the shared 
memory space lz, and the partition 1 maps (exports) the 
shared area B from the real address space 2 in the partition 
to the shared memory space 2z. 

Furthermore, the shared area A is shared by the partition 
1, the partition 2 and the partition 3. The shared area B is 
shared by the partition 2 and the partition 3. Like the 
partitions 2 and 3 for the shared area A and the partition 3 
for the shared area B, a partition which has not a main 
memory of the pertinent area therein, i.e., a partition which 
accesses a physical main memory of the shared area 
included in another partition, is referred to as partition of 
import side. In the partition of the import side, an area 
(window) for accessing the main memory area included in 
another partition is formed in a portion of its own partition 
where a main memory of a real address is not mounted. 
Therefore, the partition 1 maps (imports) the shared area A 
from the shared memory space lz to the real address space 
la in the partition. The partition 2 maps (imports) the shared 
area A from the shared memory space lz to the real address 
space lb in the partition, and maps (imports) the shared area 
B from the shared memory space 2zto the real address space 
2a in the partition. 

The foregoing description is summarized as follows. In 
the example of FIG. 3, the following (bidirectional) address 
translation operations are required at gateways of respective 
nodes. 

The shared area A is exported from the main memory 1 in 
the node 0 to the shared memory space lz. 
Nodes 4 and 5 (the Second Partition 1001) 

The shared area B is exported from the main memory 2 in 
the node 4 to the shared memory space 2z. 

The shared area A is imported from the shared memory 
space lz to the area la in its own partition. 
Nodes 6 and 7 (the third partition 1002) 

The shared area A is imported from the shared memory 
space lz to the area lb in its own partition. 

The shared area B is imported from the shared memory 
space 2z to the area 2a in its own partition. 

The address of each of not-shared areas which are not 
mentioned in the foregoing description is not translated even 
when a CCC command is outputted from a node. 

FIG. 4 collectively shows address translation functions 
required at respective gateways of respective nodes in the 
present invention. Real addresses are divided into two 
classes, i.e., not-shared address (area which can be accessed 
from only the inside of a partition) and shared address 
(address shared by partitions). In addition, the shared 
addresses are divided into two classes, i.e., the case where 
there is a main memory in its own partition, and the case 
where there is not a main memory in its own partition. 

(1) Not-shared Area 

As for the not-shared area which is accessed from only the 
inside of the partition, the address in partition is used in the 
switch as well. Therefore, address translation at the gateway 
of the node is not conducted. The snoop command is 
multicast only to nodes in the partition. 

(2) Shared Area (Export Side) 

As for a shared area having a physical main memory in 
the partition, translation between the address of the main 
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memory in the own partition and the address of the shared 
memory space in the switch is required. The snoop com- 
mand is sent to all nodes. Such optimization that the snoop 
command is sent to only nodes of sharing partitions is also 

s possible. 

(3) Shared Area (Import Side) 

As for a shared area having no physical main memory in 
the partition, translation between the address of a window 
for accessing a shared area in the own partition and the 

10 address of a shared memory space in the switch is required. 
The snoop command is sent to all nodes. In this case as well, 
such optimization that the snoop command is sent to only 
nodes of sharing partitions is also possible. 

Furthermore, in order to realize the fault containment 

15 between partitions, unauthorized access between partitions 
is checked. In ordinary highly multiplexed SMP, access 
authority check is conducted at the time of address transla- 
tion when the CPU translates a virtual address to a real 
address. Assuming that the system software of an OS or the 

20 like of another partition runs away, however, access author- 
ity check on the CPU side which issues the access is not 
enough, but it is necessary to conduct access authority check 
in the accessed partition. Furthermore, since the cache 
coherence management using the bus snoop protocol is 

25 conducted between CPUs, not only data in the main memory 
but also data cached in the cache must be protected from 
unauthorized access. In order to realize the above described 
check, the following check is conducted at the entrance of 
each node on a CCC command which has arrived from 

30 another node. 

(1) Not-shared Area 

Commands from nodes located outside the partition are 
not authorized. 

(2) Shared Area (Both the Export Side and the Import Side) 
35 In the address translation hardware, there is provided 

means for storing nodes authorized to share each shared area 
(i.e., nodes sharing the area) in a bit map form. Commands 
from nodes which are not authorized to share the shared area 
are not accepted. 

40 The address translation heretofore described is dynami- 
cally set by system software such as an OS or middle 
software when the shared memory is allocated. The scheme 
of setting will be described later. 

Hereafter, the configuration of a shared memory multi- 

45 processor which realizes the inter-partition shared memory 
mechanism and the inter-partition address translation here- 
tofore described will be described in detail. 

With reference to FIG, 1, the address translation circuit 
140 is disposed at the gateway of each node, to be concrete, 

50 between the main memory access circuit 130 and the inter- 
node connection switch 900. The address translation circuit 
140 is a circuit for conducting address translation inside and 
outside the node, destination specification of a CCC com- 
mand sent to the outside of the node, and error check for a 

55 CCC command sent from the outside of the node, on the 
basis of information written in the address translation table 
150, the partition configuration information 160, and the 
main memory configuration information in partition 165. 
The address translation table 150, the partition configu- 

60 ration information 160, and the main memory configuration 
information in partition 165 have been subjected to memory 
mapping. The processors in the partitions can access them. 

FIG. 5 shows the format of an entry of the address 
translation table 150. In order to conduct address translation 

65 on the shared area, the address translation table has one entry 
shown in FIG. 5 per shared area. The address translation 
circuit conducts full- associative bidirectional address trans - 
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lation. Therefore, the number of shared areas exported and 
imported by a certain node is limited by the number of 
entries of the address translation table. As a result, the 
number of entries of the address translation table must be 
sufficiently large. Each entry has the following information. 5 
"Enable bit" (E) 151 

The entry is made valid. 
"Address in partition" 152 
"Address in shared memory space" 153 

(These two addresses are respective start addresses.) 10 
"Size" 154 

Size of shared space 
"Home node number" 155 

Number of a node having the pertinent shared area in the ]5 
main memory 
"Nodes to be authorized to share" 156 

A list of node numbers authorized to share the pertinent 
area (stored in a bit map form). 

(Whether nodes are authorized to share is determined by 20 
taking a partition as the unit. With due regard to the 
processing efficiency, storage is conducted in the present 
table by taking a node number as the unit.) 

Here, the shared memory is handled every certain fixed 
unit (such as 1 MB). In that case, fields 152 to 154 become 25 
multiples of 1 MB, Therefore, twenty low-order bits of the 
fields 152 to 154 are fixed to 0. Bits of the table are not 
mounted. 

FIG. 7 shows details of the partition configuration infor- 
mation 160. The partition configuration information 160 30 
shows how the system is partitioned. The partition configu- 
ration information 160 is formed of a bit map 161 showing 
which nodes are included in which partition, and a flag 162 
showing which partition its own node is housed. The number 
of entries of the table is the total number of nodes of the 35 
system. Therefore, up to the case of one node/one partition 
can be supported. FIG. 7 shows values of the partition 
configuration information 160 in the nodes 0 to 3 illustrated 
in FIG. 2. In other words, in this example, the partition 0 is 
its own partition, and the partitions 3 to 7 are not used. In 40 
other nodes as well, the same information is stored except 
the local partition flag 162. 

FIG. 8 shows details of the main memory configuration 
information in partition 165. The main memory configura- 
tion information in partition indicates the address range the 45 
local main memory of each of the nodes included in the own 
partition takes charge of. The main memory configuration 
information in partition does not have information of other 
partitions. The number of entries is the total number of nodes 
of the system. Up to the case where the whole system is 50 
formed of one partition can be supported. The main memory 
configuration information in partition includes a start 
address 166 and an end address 167 of the main memory 
each node takes charge of, and a V bit 168 indicating that the 
entry is valid. 55 

Commands exchanged by the inter-node connection 
switch 900 will now be described. In the hierarchical SMP, 
data reading/writing (cache coherent check: CCC) of each 
CPU is managed by the bus snoop protocol. As for the bus 
protocol in the node, a known technique is adopted. As for 60 
the CCC between nodes as well, a known technique is used. 
Here, as an example of the CCC between nodes, outline of 
a procedure of the bus snoop using a fetch command 
between nodes will now be described. For the purpose of 
description, it is now assumed that partitioning is not con- 65 
ducted. In other words, it is now assumed that the whole 
system is formed of one partition. 
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(1) In the case where a certain CPU reads data, the fetch 
command is issued to all other nodes (all CPUs and the 
main memory of the home node). 

(2) If there is modified data (latest data) in the cache in the 
own node, the node which has received the fetch 
command returns that data. Otherwise, the node which 
has received the fetch command replies that there are 
no modified data. If there are no modified data in the 
cache, the home node returns data in the main memory. 

(3) In the accessing node, replies from other nodes are 
totalized. If modified data in the cache is sent from 
another node, the data is returned to the accessing CPU. 
If every node does not have modified data, data 
returned from the main memory (data of the main 
memory in the own node in the case where the own 
node is the home node) is returned to the accessing 
CPU. For determining whether data returned from the 
main memory should be used, it is necessary to wait the 
reply "there are no data" from all nodes. 

Hereafter, CCC commands used on the bus within the 
nodes of the hierarchical SMP will be described. Characters 
enclosed in parentheses are abbreviations used in the 
embodiment. 
Fetch (F) 

It requests line transfer of data. 

It is issued in the case where a read command of the CPU 
has failed. 
Fetch&Invalidate (Fl) 

It requests invalidation of data on other caches simulta- 
neously with the line transfer of data. 
It is issued in the case where a read command of the CPU 
has failed. 
Invalidate (I) 

It requests invalidation of data on other caches. It is issued 
in the case where the CPU has issued a write request to 
a cache line shared with other caches. 
WriteBack (WB) 

It requests writeback of a cache line. 
It is issued when data has been turned out by replacement. 
Data (D) 

It is a command for returning data in response to the F or 
FI command in the case where there is modified (latest) 
data in the cache. 
DataMem (DM) 

It is a command for returning data in the main memory of 
the home node in response to the F or FI command. In the 
home node, it is returned in the case where there are no 
modified (latest) data in the cache within the node. In the 
case where a D command has come from any other node, 
data returned by the DM command is ignored. (Data in the 
cache is given priority.) 

It is also used for the reply to UncachedRead. 
NoData (ND) 

It is returned in response to the F or FI command in the 
case where there are no modified (latest) data in the 
cache of the pertinent node (except the home node). 
DummyNoData (DND) 

A reply returned from a node located outside the partition 
in response to the F or FI command. 

(DND is not an ordinary CCC command, but it is a 
command required in implementation specific to the 
present embodiment. Its meaning is the same as that of 
ND.) 



09/22/2003, EAST Version: 1.04,0000 



US 6,51 

11 

UacachedRead (UR) 
UncachedWrite (UW) 

It is a command for directly accessing the main memory 
in cache off. 

A command in the inter-node connection switch has the 
following fields specific to the present invention besides 
fields required for CCC, such as command, address, and 
data. 

(1) Destination Node Number 

The destination node number is represented by a bit map. 
By virtue of representation of destination using a bit map, it 
is possible to easily realize multicast directed to a plurality 
of specific nodes, such as nodes in the partition, by setting 
a plurality of bits and further broadcast directed to all 
processors of the system by setting all bits. 

(2) Global Bit 

It is a bit for determining whether the pertinent access 
command is a command for a shared area (hence the address 
is an address of the shared memory space) or a command for 
a not -shared area (hence the address is a local address in the 
partition). 

Hereafter, operations of the address translation circuit in 
the case where a command is issued to another node, and in 
the case where a command has been received from another 
node will be described separately for respective cases and 
successively. 

(A) Processing in the Case where a Command is Issued from 
a Node 

If a CCC command is delivered from the main memory 
access circuit 130 to the address translation circuit, it is first 
inputted to the shared area detect circuit 141. 

FIG. 9 shows details of the shared area detect circuit 141. 
From an inputted command 141c, a part of an access address 
1410b is taken out. The access address 14106 is inputted to 
circuits 1410 and 1411. 

In the circuit 1410, it is detected by using the main 
memory configuration information in partition 165a whether 
the access address 14106 is included in any of main memo- 
ries within the local partition or not. In other words, it is 
checked for every bit having the valid bit 168 of FIG. 8 set 
equal to 1 whether the access address 14106 is located 
between the start address 166 and the end address 167. If the 
access address is included in any of the main memories, then 
"1" is outputted to output 1410a and the access address 
14106 is judged to be an address having a main memory in 
the partition. Since the internal circuit is the same as that of 
FIG. 6, detailed description thereof will be omitted. 

In the circuit 1411, it is detected by using the information 
150a of the address translation table whether the access 
address 14106 is included in any of partitions' shared areas 
defined in the address translation table or not. In other 
words, it is checked for every bit having the E bit of FIG. 5 
set equal to "1" whether the access address 14106 is 
included between the "address in partition" 152 and the 
"size" 154. If the access address is included in any of the 
partitions' shared areas, then "1" is outputted to output 
1411a, and the access address 14106 is judged to be one of 
shared areas exported from the inside of the partition or 
imported from the inside of the partition. 

If a gate 1412 outputs"!", i.e., if the access address 14106 
has a main memory in the partition, but it is not a shared area 
exported or imported, then the access command is distrib- 
uted to a not-shared side 1416 by a gate 1413. Otherwise 
(i.e., if the output of the gate 1412 is "0"), i.e., the access 
address 14106 does not have a main memory in the partition, 
or it is an exported or imported shared area, then the access 
command is distributed to a shared side 141a. 
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On the not-shared side 1416, the access command is sent 
to the destination indicate and global bit addition circuit 142 
without being subject to address translation. As a result, 
address translation is not conducted on the access command 

5 sent to the not-shared area. 

On the other hand, on the shared side 141a, the access 
command is inputted to the output side translator 144. FIG. 
6 shows details of the output side translator 144. From an 
inputted command 141a, an access address 1442a is sepa- 

10 rated. It is determined by a range detect circuit 1440 whether 
the access address 1442a is included in the "address in 
partition" 152 to the "size" 154 of a valid entry of the 
address translation table. The range detect circuit is provided 
at the rate of one per entry of the address translation table 

15 (1440 to 1440'). If the access address 1442a is included in 
the range of address in the partition of the address translation 
table, then a signal 1440a is outputted. If the signal 1440a 
is "1", then the "address in partition" 152 of a pertinent 
address translation table is outputted to an output A 1441a, 

20 and the "address in shared memory space" 153 is outputted 
to an output B 14416. In an address translation circuit 1442, 
a new address 14426 is calculated according the following 
equation by using an old base address 1441a outputted from 
the signal A (corresponding to the address in partition in the 

25 address translation table), a new base address 14416 out- 
putted from the signal B (corresponding to the shared 
memory area address in the address translation table), and 
the input address 1442a. 

3Q New Address=Input Addicss-Old Base Addrcss+New Base 

Address 

By virtue of this calculation, the input address issued from 
the inside of the partition (local real address of the partition) 
can be relocated to an address of the shared memory space. 

35 Also in the case where the access address has been included 
in another entry, similar processing is advanced via the range 
detect circuit 1440'. Together with a part of the command 
other than the address, the output address 14426 is sent to the 
destination indicate and global bit addition circuit 142 via a 

4Q gate 1445 (enabled by an OR 1443a of selection signals). If 
the input address 1442a does not match with any of entries 
of the address translation table, then an error is detected. 

The destination indicate and global bit addition circuit 
142 determines which node a command outputted from the 

45 inside of the node should be outputted to, and in addition 
determines the value of the global bit in the command. FIG. 
10 shows operation of the destination indicate and global bit 
addition circuit 142. The circuit conducts predetermined 
operation according to the kind of the command and whether 

50 the command is a command directed to a not^shared area (in 
the case where the command is inputted from the signal 
1416 side) or a command directed to a shared area (in the 
case where the command is inputted from the signal 144a 
side). 

55 (1) Snoop Request (F, FI, I) Command on not-shared 
Address 

After the node configuration in the partition is derived 
from the partition configuration 160, the command is mul- 
ticast to nodes in the partition. 
60 (2) Snoop Request (F, FI, I) Command on Shared Address 

The command is broadcasted to all nodes of the system. 

The global bit is set equal to "1". 
(3) Main Memory Access (WB, UW, UR) Command on 
not-shared Address 
65 On the basis of the main memory configuration informa- 
tion in partition 165, comparison is conducted to determine 
which node takes charge of the address range the access 
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address belongs to. The home node (in the local partition) of representing nodes to be authorized to share the pertinent 

the access address is derived, and thereafter the command is entry. If there is no problem as a result of the check, the 

sent to the home node. command is sent to the input side translator 145 through a 

The global bit is set to "0". signal 143a. In other words, translation from the shared area 

(4) Main Memory Access (WB, UW, UR) Command on s address to the real address in the partition is conducted. 
Shared Address ^ t * iere ^ a P rob ^ em as a result of the check in the circuit 

From the address translation table, the "home node num- 1434 > i tf * conducted from a node other than 

ber" 155 of the access address is derived. Tlie command is partiUon auth^d t0 ^ rc perUneDl Sh A ared "?* 

sent to the home node. ( exce P l the DND )> then 11 15 detected as an error " M a result ' 

The global bit is set to "1" io u n autnor ized access to a shared area in the partition can be 

(5) Acknowledge (D, DM, ND) Command on not-shared prevented. 

Address f e IS pertinent entry m the address translation 

ThTcommand is returned to the requesting node. (DND table > thc ac ^f add ' CSS ™^ C * ^ ^ 

to a not-shared area is not generated.) memorv shared bv the P a ' Utl0n ihis node b * lon f . to * ™* 

The lobal bit is set to "0" 15 casc occurs because in the destination indicate and global bit 

(6) Aci^owledge (D^DM, ND, DND) Command on Shared addition circuit 142 ^ bas transmitted the command, the 
Address snoop request command (F, FI, I) on the shared area is 

-ru a • h,«^ tu rAniloc ,; n f , nr«ri^> broadcasted uniformly to all nodes. Therefore, the command 

The command is returned to the requesting node. . ' . . • *u u • 

Th i h I h t * t t "1" 15 sent t0 n0 ^ es which are not included in the snaring 

?^n7.o the pToSsing heretofore described, the com- 20 nodes. In this case, a DND command indicating that there is 

mand outputted from the node can be transmitted to suitable no perunent data m the node must be returned to he access 

destination via the inter-node connection switch 900. node ln to com f * ™« acknowledge to the snoop 

(B) Processing Conducted when a Node has Received a C"* access node anticipates acknowledges from all 

i, ' , s nodes to which the command has been broadcasted.) After 

°IfT™ mmand inputted from the inter-node connection 25 being checked in a circuit 1435 that the access command is 

switch 900 is inputted to the address translation circuit 140, » »°°P request, therefore, the access command s conveyed 

then the command is first inputted to the shared area detect ° acknowledge cremt 146 through a sxgnal 143c. 

j j. . 1 ■ ♦ 1 a*i t Ao**^ r> n A (if the access command is not a snoop request, an error is 

and address check circuit 143. In the shared area detect and v iA r . 7 ' _ 

, , . a -a <■ •« reported.) In the DND acknowledge circuit 146, the DND 

address check circuit 143, classification of an inputted ^f^.u, ; & > 

command and error check are conducted. FIG. 11 shows 30 ^.s^tolb^^te^w^ 

details of the shared area detect and address check circuit * returned to ^ accessing node through the destination 

143. A command 143* sent from another node is first ^ . fT,^ will nnw he 

inputted to a circuit 1430. The value of the global bit is A QperaUoji of the input side translator l« w,U now be 

F . described. In the input side translator, the shared memory 

global bit is "0", i.e., in the case of access to a 35 address in the command * translated to a real address 

not-shareLea.thecommandisoutputtedto asignal 1430,, * ««» Partition on the : has* of information °f the address 

A . . t ! - t ■ ^JL^ „ n u ac : c n f th^ translation table 150. The input side translator has the same 

Thereafter, in circuit 1432, it is checked on the basis or the . r „i„„„. 

« *• • f ™t;™ iaii «^th^r th,. oonrre internal configuration as that of the output side translator, 

partition configuration information 160 whether the source , & . . r . , ^i,,,- 

node belongs To the local partition. Thereafter, in a circuit The ^P ut / ldc translator is differen only in the translation 

1433, it is checked on the basis of the main memory 40 Section from the outpu side translator. 

t . c ... 1/CC t L flll ,„ t t,l Finally, operation of the acknowledge wait circuit 147 

configuration information in partition 165 whether the \_ *T -u a tt, i t,i^!«„;t ^itidTJc 

& , - . • i j -j * ~f .„„ „nri, in will now be described. The acknowledge wait circuit 147 is 

access address is included in main memories of any node in . . * rwrvi 

, , i * j u* u u o „ A( . 0 j _ a circuit for totalizing acknowledges (D, DM, ND, DND) to 

the local partition. A command which has not posed a " VAl 6 6 K ' ■ ' , ' 

problem in both checks in the circuits 147 through a signal * and FI commands which request readou of dab from 

143b. In this case, address translation is not conducted. A 45 another node. THe acknowledge watt . ™ "J ^ ^ » » 

command which has posed a problem in the check of either commands. The acknowledg waU c »ouU 

^ , • r . -j f,u ^ „, 147 waits until acknowledges (D, DM, ND, or DND) come 

1432 or 1433, i.e., access from the outside of the partition or w^uum .... • V j V ^ a • u 

, ' . , p *u • „ t K« Irtiii.n ;o from all nodes which have issued F or FI commands, i.e., all 

access to the outside of the main memory in the partition is * , ' n \^ 

detected as an error. As a result, unauthorized access to a n° d f ° f the ^ >° «« °f shared areas, and all nodes 

not-shared area in the partition can be prevented. 50 * part.ton (mdicated by the partition configuration 

Processing conducted in the case where the global bit is ^forma hon 160) in the case of not-sha ed areas. Then the 

"1", i.e., in The case of access to a shared area will now be acknowledge wait circuit 147 conducts the followmg judg- 

described. The access command is sent to a circuit 1431 m f n ' s ' , . „ . , . . . 

through a signal 1430i. In the circuit 1431, it is checked W In case where the , D command has been returned: 

whether the access address is included in the range of the ss If the D command has been returned from one node, and 

"address in shared memory space" 153 to the "size" 154 of ND, DND, or DM has been returned from another 

any of entries of the address translation table 150 (i.e., node, then the latest data read out by the D command, 

whether thc access address is included in any of shared areas data modified in the cache of another node is 

exported or imported by the partition this node belongs to). returned. 

If there is the pertinent entry in the address translation 60 (2) In the case where the DM command has been returned: 

table, i.e., if the access address is an address shared by a If the DM command has been returned from one node, 

partition this node belongs to, the command is conveyed to and ND or DND has been returned from another node, 

a circuit 1434 through a signal 1431a. The DND is returned then data in the main memory read out by the DM 

from a not-shared node. On the other hand, in the case of a command is returned. 

command other than the DND, it is checked in the circuit 65 (3) In the case where the ND command has been returned: 

1434 whether the source node of the access command is If ND or DND has been returned from another node, then 

included in the bit map 156 of the address translation table the own node is the home node. A reply that there are 
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no data is returned. Thereafter, the main memory access then the F command is conveyed to the DND acknowledge 
circuit 130 reads out data in the main memory 180 and circuit 146, and a DND command is returned to the access- 
returns the data. ing node (global bit=l). 
The operation of the acknowledge wait circuit 147 here- [ n the accessing node which has received the acknowl- 
tofore described is the same as that of the cache coherent 5 e( jge (D, DM, ND, DND) for the F command, error check is 
check circuit of the conventional hierarchical SMP. conducted in the shared area detect and address check circuit 
Therefore, details of the internal configuration will be omit- ^ and thereafter the acknowledge is conveyed to the 

tec *- . _ acknowledge wait circuit 147 through the input side trans- 
How access to another node is conducted in the shared latof 145 In this ^ the access address ^ restored from 

memory multiprocessor of the present invention heretofore 1Q ^ address of ^ sharcd m to the rcal addrcss 

described will hereafter be described by taking the F com- Qf ^ ^ ^ acknowledge wait cjrcuit 

a)F ComSTnNot-shared Area 147 ;. rc P lies from ^ Qod ™ of the ^stem are waited, and the 

In a node which has issued the F command, the F re P h / s are returned to he accessing CPU Answering is 

command outputted from the inside of the node is sent from conducted by using the local real address of the accessing 

the shared area detect circuit 141 to the destination indicate 15 partition. 

and global bit addition circuit 142 through the signal 1416 As heretofore described, owing to the address translation 

(address translation is not conducted), and multicast to mechanism of the present invention, data can be accessed by 

nodes in the partition (global bit-0). using the local real address in the partition in both the 

In the node which has received the F command, error accessing node and the accessed node, 

check is conducted in the shared area detect and address 20 How the system software such as the OS manages the 

check circuit 143, and thereafter the F command is conveyed shared memory among partitions of the shared memory 

to the inside of the node through the signal 143£> and the multiprocessor in the present invention will now be 

acknowledge wait circuit 147 (address translation is not described. The OS itself may manage the shared memory 

conducted). among partitions, or system software for managing the 

The acknowledge (D, DM, ND) for the F command is sent 25 shared memory among partitions, such as middleware other 
from the shared area detect circuit 141 to the destination than the OS, may manage the shared memory among par- 
indicate and global bit addition circuit 142 through the titions. 

signal 1416 (address translation is not conducted), and In the present system, the partitions do not have a shared 

returned to the accessing node (global bit=0). memory at all when the system is initialized. Communica- 

In the accessing node which has received the acknowl- 30 tion between partitions is conducted by using inter-OS 

edge for the F command, error check is conducted in the communication means (170 and 910). Every inter-partition 

shared area detect and address check circuit 143, and there- communication for sharcd memory management hereafter 

after the acknowledge is conveyed to the acknowledge wait described is conducted by using this inter- OS communica- 

circuit 147 through the signal 1436 (address translation is tion means. 

not conducted). In the acknowledge wait circuit 147, replies 35 Hereafter, a node having a physical main memory of a 

from all nodes in the partition are waited, and are returned shared memory (a node which exports the shared memory is 

to the accessing CPU. referred to as home node. A partition the home node belongs 

(2) F Command on Shared Area to is referred to as home partition. Hereafter, with reference 

In the node which has issued the F command, the F to FIGS. 12 to 16, the procedure of the operation of the 

command outputted from the inside of the node is sent from 40 system software will be described in detail by dividing it into 

the shared area detect circuit 141 to the destination indicate allocation of a shared area in the main memory, addition of 

and global bit addition circuit 142 through the output side a shared partition, and erasing of a shared area, 

translator 144. In this process, the access address is trans- As for the management of the shared area, it is hereafter 

lated from the real address of the accessing partition to the assumed that the home partition basically conducts concen- 

address of the shared memory space. The F command is 45 trated management. Also in the case where some unit other 

broadcasted to all nodes of the system (global bit=l). than the home partition conducts the management, the 

If the node which has received the F command is included management can be realized by using the similar technique, 
in a partition sharing the accessed shared area, then the F (A) Allocation of the main memory of the shared area 
command is subjected to error check in the shared area Allocation of the main memory of the shared area is con- 
detect and address check circuit 143, and thereafter the F 50 ducted in the home partition (FIG. 12). A main memory of 
command is conveyed to the acknowledge wait circuit 147 a requested size is allocated to be used as a shared area, and 
through the input side translator In this process, the access necessary data are written into the address translation table, 
address is translated from the address of the shared memory The shared area can be dynamically allocated, 
space to the real address of the accessed partition. First, an unused entry (an entry having E-0) in the address 
Thereafter, the F command is conveyed to the inside of the 55 translation table of each node in the partition is allocated 
node. Snoop in the node is conducted by using the local real (step 2001). A real memory of the requested size is allocated 
address of the accessed partition. in the local main memory (step 2002). Thereafter, the system 

The acknowledge (D, DM, ND) for the F command software communicates with system software of other 

generated as a result of the above described process is sent partitions, and allocate an area of the requested size in the 

from the shared area detect circuit 141 to the destination 60 shared memory space (step 2003). In all partitions, the 

indicate and global bit addition circuit 142 through the shared memory must have the same address map. 

output side translator 144. In this process, the access address Subsequently, the following data concerning the shared 

is translated from the real address of the accessed partition area are written into the address translation table of each of 

to the address of the shared memory space again. The nodes in the partition (step 2004). 

command is returned to the accessing node (global bit— 1). 65 "Real address in partition" 152 (start address) 

If the node which has received the F command is not "Address in shared memory space" 153 (start address) "Size 

included in the partition sharing the accessed shared area, of shared area" 154 
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"Home node number" 155 (number of node which has 

secured a main memory) 
"Bit map of node to be authorized to share" 156 (initialize 

it to 0) 

At this time point, access authority to the shared area is s 
not given to any partition. Thereafter, "1" is written into E 
bit of the address translation table of each bode in the 
partition, and the entry is made valid (step 2005). 

By the processing heretofore described, an area in the 
main memory can be exported. It becomes possible to access 10 
the shared area allocated in the step 2002 from the address 
in the shared memory space allocated in the step 2003. 

It should be noted here that the same content must be 
written into address translation tables of all nodes in the 
partition. 15 
(B) Addition of Sharing Partition 

For making it possible for other partitions to access the 
shared area allocated in (A) (i.e., to import the shared area), 
the following procedure is necessary. The import processing 
of the shared area can be dynamically conducted at arbitrary 20 
time after the shared area is allocated in the home partition. 
(Bl) Partition which Newly Imports Shared Area 

In a partition of import side which is going to newly share 
a certain shared area, the following processing is necessary 
(FIG. 13). 25 

First, an unused entry (an entry having E-0) in the address 
translation table of each node in the partition is allocated 
(step 2101). A real space, to which a physical main memory 
is not mapped, of a requested size is allocated (step 2102). 
This real space is used as a window for accessing a shared 30 
area having a physical main memory in another partition. 
Thereafter, access permission of a pertinent shared area is 
requested from the system software of the home partition 
(step 2103). Here, in a partition which newly imports the 
shared area, the home node conducts processing of steps 35 
2201 to 2206 (which will be described later) and waits return 
of a reply of access authorization. 

Thereafter, if access authorization comes from the home 
partition, an entry concerning the shared area of the address 
translation table sent together with access authorization is 40 
received (step 2104). Contents of the entry of the address 
translation table received in the step 2104 (except the E bit 
and address in the partition) are written into the address 
translation table of each of nodes in the partition, and the real 
address allocated in step 2102 is written into the "address in 45 
partition" field 152 of the address translation table of each of 
nodes in the partition (step 2105). Finally, "1" is written into 
the E bit of the address translation table of each of nodes in 
the partition. 

By the processing heretofore described, import of 10 the 50 
shared area is completed, and it is possible to access the 
shared area in the home node from the real address allocated 
in the step 2102. 
(B2) Home Partition 

In the system software of the partition requested 15 to 55 
permit access to the shared area in the step 2103, processing 
described hereafter becomes necessary (FIG. 14). First the 
access request is checked, and it is decided whether the 
requesting partition is permitted to share the pertinent shared 
area (step 2201). Hereafter, a procedure in the case where the 60 
access is permitted will be described. 

Subsequently, an old value of the "nodes to be authorized 
to share" field 156 of an entry corresponding to the shared 
area of the address translation table is stored into a work 
variable W (step 2202). In the "nodes to be authorized to 65 
share" field 156 of an entry corresponding to the shared area 
of the address translation table possessed by each of nodes 
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in the partition, bits corresponding to all nodes which belong 
to the requesting partition are set (step 2203). As a result, the 
requesting partition is authorized to share the pertinent 
shared area. 

Furthermore, the new value of the "nodes to be authorized 
to share" field 156 is sent to each of partitions (except the 
own partition) which includes nodes having W set equal to 
1, and each of the partitions is requested to modify the field 
(step 2204). It should be here noted that the system software 
which manages shared areas is one in number every partition 
and each partition needs only to be requested once. Arrival 
of ACK telling that modification of the "nodes to be autho- 
rized to share" field 156 has been completed, from every 
partition requested in the step 2204 is waited (step 2205). 

Finally, the entry of the shared area of the address 
translation table is sent to the requesting partition together 
with the access authorization of the shared area (step 2206). 

By virtue of the processing heretofore described, the 
"nodes to be authorized to share" field 156 in the address 
translation table of every node which shares the pertinent 
shared area is updated, and the requesting partition can share 
the area. 

(B3) Partitions which have Shared the Pertinent Area Until 
then (Except Home Partition) 

In the partitions which have shared the pertinent area until 
then (except the home partition), the request issued by the 
home partition in the step 2204 is received, and the "nodes 
to be authorized to share" field 156 of an entry correspond- 
ing to the shared area of the address translation table 
possessed by each of nodes in the partition is updated. Upon 
completion of updating, the ACK is returned to the home 
partition. 

(C) Erasing of Shared Area 

Erasing of a shared area is started by the home partition, 
and all partitions which have shared the shared area execute 
a procedure hereafter described, in cooperation. Erasing of 
a shared area can also be conducted dynamically. After the 
shared area has been erased once, resources, such as the 
resource address translation table, the main memory, and the 
shared address space, which have been used until then can 
be utilized again. 

It is assumed in the processing hereafter described that the 
application program has finished the use of the pertinent 
shared area. Therefore, access to the shared area to be 
opened is not caused. 
(CI) Processing in Home Partition 

The home partition makes all other partitions sharing the 
pertinent area erase the shared area, and thereafter erases the 
shared area of its own partition. 

First, the home partition requests every partition which 
includes nodes corresponding to 1 in the "nodes to be 
authorized to share" field 156 of the address translation table 
(except its own partition) to erase the shared area (step 
2301). And the home partition waits for every requested 
partition to return the ACK (step 2302). As a result, it is 
assured that the shared area has been erased in the partitions 
which imported the shared area. 

Subsequently, the home partition purges all data of the 
shared area to be erased, in caches of all processors in its 
own partition (step 2303). Old data in the cache are thus 
driven out. Thereafter, the home partition invalidates the 
entry (i.e., writes "0" in the E bit) corresponding to the 
shared area to be erased, in the address translation table of 
every node in the partition (step 2304). The home partition 
releases the area in the shared memory space and the 
physical memory of the shared area (step 2305). Finally, the 
home partition purges the TLB of every processor in the 
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partition (step 2306). By virtue of the processing heretofore 
described, resources used to access the shared area can be 
completely released. 

(C2) Processing in Partitions which Import Shared Area 

In each partition requested to erase the shared area by the 
home partition in the step 2301, processing heretofore 
described becomes necessary. 

First, in all processors in its own partition, data of the 
shared area to be erased, in caches of all processors in its 
own partition are purged (step 2401). Thereafter, the perti- 
nent entry in the address translation table of every node in 
the partition is invalidated (step 2402). The shared memory 
space, and the area (window area) allocated in the real 
address space in the partition to access the shared memory 
are released (step 2403). The TLB of every processor in the 
partition is purged (step 2404), 

As a result, resources used to access the shared area can 
be completely released. Finally, the ACK is returned to the 
home partition (step 2405) to inform the home partition that 
the shared area has been completely released in its own 
partition. 

By virtue of the procedure heretofore described, it is 
possible in the shared memory multiprocessor of the present 
invention to dynamically manage the memory shared among 
partitions. 

By virtue of the configuration heretofore described, it is 
possible in the hierarchical bus connection SMP to realize a 
memory shared by partitions. 

The function of resetting the partitions will now be 
described in detail. FIG. 17 shows a reset circuit according 
to the present invention. FIG. 17 shows only node 0. Other 
nodes also have absolutely the same configuration. 
Hereafter, the configuration of the node 0 will be described 
in detail. The CPU, the main memory access circuit, and the 
address translation circuit in each node are driven by a reset 
signal 921. Respective nodes have independent reset signals 
921 to 928. Respective reset signals are driven by a reset 
transfer circuit 920. In addition, each node has a reset 
register 175 which can be accessed from CPUS, and each 
node can request resetting of other nodes. 

First, if an external reset signal (reset at the time of power 
on, a signal from an ordinary reset button) 929 is driven, 
then reset signals (921 to 928) of all nodes are made active 
and all nodes are reset as usual. 

The reset register of each node includes a node bit map 
1751 indicating nodes to be reset, and an enable bit 1750. If 
the enable bit 1750 is set to "V, the reset signal is sent to 
nodes corresponding to "1" in the bit map 1751 through the 
circuit 920. 

By virtue of the circuit heretofore described, it becomes 
possible for software of each node to reset arbitrary nodes. 

In the case where a certain partition (hereafter referred to 
as partition A) suffers from system down, the following 
operations are conducted by software of one (hereafter 
referred to as partition B) of partitions which does not suffer 
from system down. 

(1) The partition B senses that the partition A has suffered 
from system down. (It can be known by a heart beat 
function or the like via the shared memory.) 

(2) The partition B conducts investigation of error cause and 
removal of an error cause via the memory shared by 
partitions, and prepare information so as to allow the 
partition A to resume its task. 

(Since arbitrary addresses including the system area can be 
shared in the shared memory function of the present 
invention, error analysis including the OS is possible.) 

(3) The partition B sets bits of the bit map 1751 correspond- 
ing to nodes of the partition A to "1" (and sets other bits 
to "0".) 
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(4) The partition B sets the enable bit 1750 to "1" and then 
restores it to "0". (As a result, a reset signal is conveyed 
to respective nodes of the partition A through the reset 
transfer circuit 920.) 

(5) Owing to the processing heretofore described, the par- 
tition A can resume the execution. (By using the infor- 
mation of (2), application is resumed.) 

By virtue of the reset function heretofore described, it 
becomes possible for partitions to back up each other, and 
high reliability can be realized. 

The present invention is not limited to the embodiment 
heretofore described, but can be applied to various varia- 
tions. 

(1) In the foregoing description, the snoop command on the 
shared area is broadcasted to all nodes in the system. 

In contrast thereto, there is also possible such a scheme as 
to broadcast only to nodes in partitions sharing the accessed 
shared area in order to reduce the traffic imposed on the 
switch. 

The following points are changed. 

When the destination of a command outputted from the 
inside of a node is decided in the destination indicate 
and global bit addition circuit 142 (see FIG. 5), the 
snoop (F, FI, I) command on a shared area is multicast 
to only nodes corresponding to "1" in the "partitions to 
be authorized to share" field 156 of the address trans- 
lation table corresponding to the shared area. (As a 
result, the snoop command is not sent to partitions 
which do not share the area.) 

The DND command and the DND acknowledge circuit 
are not used. 

When acknowledges of the command on the shared area 
are waited in the acknowledge wait circuit 147, as 
many acknowledges as nodes corresponding to "l"s in 
the "partitions to be authorized to share" field 156 of 
the address translation table corresponding to the 
shared area are waited. (As a result, acknowledges of 
the command in the sharing partitions are waited.) 

(2) In the foregoing description, nodes to be authorized to 
share are stored in the bit map 156 corresponding to each 
node. However, it is also possible to store nodes to be 
authorized to share in a bit map corresponding to each 
partition. 

(3) In the foregoing description, the main memory access 
circuit 130 and the main memory 180 in the node are 
inserted between the address translation circuit 140 and 
the CPU bus 190. However, there is also possible such a 
scheme that the address translation circuit 140 is directly 
coupled to the CPU bus 190 (but is not directly coupled 
to the main memory access circuit 130). Apart from the 
address translation circuit 140, the main memory access 
circuit 130 is connected to the CPU bus 190. In this case, 
access from another node to the main memory 180 is 
conducted via the CPU bus 190. In this case as well, the 
present invention having the address translation means 
140 at the entrance of each node can be utilized as it is. 

(4) In the foregoing description, the components 141 to 145 
and 147 are shown as separate circuits. By making the 
circuits of the output side (141,142 and 144) and the 
circuits of the input side (143, 145 and 147) common 
respectively, however, duplicated circuits such as the 
search of the address translation table can be reduced. 

(5) In the foregoing description, addresses of the shared 
memory space can be taken in arbitrary positions. In order 
to realize arbitrary translation, address translation is real- 
ize in a full associative table. 

In contrast thereto, by representing the node number by 
high-order bits of address of the shared memory space and 
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representing the real address in the partition by lower-order 
bits, the address translation on the export side in the home 
partition can be significantly reduced. In this case, a real 
address in the partition can be translated to an address of the 
shared memory space by simply adding a node number as 
high -order bits. An address of the shared memory space can 
be translated to a real address in the partition by simply 
removing the node number in the high- order bits. 

In this case, in order to store, in each node, the manage- 
ment information of shared areas with respect to the whole 
main memory space in the partition, there is needed a table 
for storing, every managed unit (having, for example, 1 MB) 
of shared area, a bit indicating whether sharing is possible 
(i.e., a bit for storing information as to whether the 1-MB 
space is a shared area or a not-shared area) and a bit map of 
nodes to be authorized to share (which is the same infor- 
mation as the field 156 of the address translation table). 

As a result, the address translation and access check on 
the export side (home partition) can be significantly simpli- 
fied and the hardware can be reduced. 

(6) In the foregoing description, the address translation table 
of the export side is common to the address translation 
table of the import side. On the import side, however, a 
table different from that of the export side may be used. 
In this case, the "home node number" 155 on the export 
side (which is redundant information) can be eliminated. 

(7) In the foregoing description, it is possible to freely read 
and write data with respect to the snared area. By making 
the shared area read only from the outside of a partition, 
however, the hardware can be simplified. In this case, 
check of the address and the node number (check in the 
circuits 1432 to 1434) in the shared area detect and 
address check circuit 143 is unnecessary. Furthermore, the 
"nodes to be authorized to share*' field 156 needs only to 
be stored in the home node alone. If a write command 
such as FI, I, WB or UW comes from the outside of the 
partition, the shared area detect and address check circuit 
143 reports an error. 

Furthermore, by combining (5) to (7), the hardware 
quantity can be significantly reduced. 

(8) By adding a read only bit to the address translation table, 
it is also possible to set only a specific shared area to read 
only from the outside of the partition. If in that case a 
write command such as FI, I, WB or UW comes from the 
outside of the partition to a shared area having a set read 
only bit, the shared area detect and address check circuit 
143 reports an error. 

(9) In the foregoing description, the destination indicate and 
global bit addition circuit 142 located at the exit of each 
node specifies the destination of the snoop (F, FT, I) 
command issued from the node. In contrast thereto, there 
is possible to adopt such a scheme that the inter-node 
connection switch 900 has information equivalent to the 
partition configuration information 160 and the destina- 
tion is specified in the inter-node connection switch 900, 
In that case, if the global bit in the command is "1" (i.e., 
in the case of access to a shared area), the command is 
broadcasted to all nodes in the system is conducted. If the 
global bit in the command is "0" (i.e., in the case of access 
to a not-shared area), the command is multicast only to 
nodes in the same partition as the destination node. 

(10) In the foregoing description, CPUs in the node are 
connected by the bus 190. However, other connection 
forms (such as connection using a switch, or one-to-one 
connection to the main memory access circuit 130) may 
also be used. 

(11) In the above described embodiment, the main memory 
configuration information in partition 165 separately 
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stores the ranges of the main memory respective nodes 
take charge of, by using pairs of the start address 166 and 
the end address 167. However, by using, for example, 
such a technique as to make the end address of the node 

5 n common to the start address of the node n+1, the 
hardware can be reduced. Furthermore, each node may 
take charge of a plurality of real address areas by provid- 
ing a plurality of pairs of the start address 166 and the end 
address 167 as the areas each node takes charge of. 

10 (12) In the foregoing description, CPUS (110 to 113) in the 
node have independent caches. However, an external 
cache (level-3 cache) shared by a plurality of CPUs may 
be provided. It is also possible to provide the main 
memory access circuit 130 of each node with a copy of the 

]5 cache TAG of the CPU, and filter the cache coherent 
transaction coming from another node. 

(13) In the foregoing description, the inter- node connection 
switch 900 is connected by a crossbar network. However, 
the inter-node connection switch 900 may also be con- 

20 nected by using a network of a different form (such as 
complete coupling or a multi-stage network). 

(14) In the above described embodiment, broadcasting or 
multicasting is ordered by specifying the destination node 
in the bit map when issuing a network command to other 

25 nodes. However, broadcasting or multicasting may be 
realized by issuing a plurality of commands for each of 
the destination nodes from the destination indicate and 
global bit addition circuit 142. 

According to the present invention, the inside of the 
30 hierarchical bus connection SMP is divided into a plurality 
of partitions. There is provided, at the gateway of each node, 
means for bidirectionally translating an address of a shared 
area between an address in a partition and a shared memory 
space address common to partitions, when realizing a shared 
35 memory between partitions. As a result, each partition has a 
free address space, and data in a shared area can be managed 
at high speed by a snoop protocol. In addition, an inter- 
partition shared memory mechanism making possible fault 
containment between partitions can be realized. 
40 Furthermore, by providing means for dynamically generat- 
ing and erasing the above described address translation 
information, the shared memory between partitions can be 
managed flexibly. 
What is claimed is: 
45 1. A shared memory multiprocessor system having a 
plurality of nodes and a network for interconnecting nodes, 
each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 
coherent control being conducted between nodes sharing a 
50 main memory via said network, said plurality of nodes being 
divided into a plurality of partitions, each of said partitions 
including at least one node, the partitions respectively 
locally sharing a main memory, 

wherein a memory address of said system comprises a 
55 local real address and an address of common memory 
space, said local real address being local to each of said 
plurality of partitions, said address of common memory 
space being used in common in a memory space used 
in common between said plurality of partitions, 
60 wherein a main memory of each node is accessed by using 
a local real address of a partition the node belongs to, 
and 

wherein as for access from each CPU to a main memory 
of another node, data of a main memory of a node in its 
65 partition is accessed via said network by coherently 
using a local real address of the partition, and access to 
data of a main memory of a node included in another 
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partition is conducted by translating an access address 
to an address of said shared memory space when an 
access command is issued to said network and trans- 
lating an address of said shared memory space to a local 
real address of said another partition when said access 
command enters said node included in said another 
partition. 

2. A shared memory multiprocessor system having a 
plurality of nodes and a network for interconnecting nodes, 
each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 
coherent control being conducted between nodes sharing a 
main memory via said network, said plurality of nodes being 
divided into a plurality of partitions each including at least 
one node locally sharing a main memory, a local real address 
local to each of said plurality of partitions is used as a 
memory address of the partition, 

wherein said shared memory multiprocessor includes an 
address translation means provided at a gateway of 
each of said plurality of nodes to said network, said 
address translation means being used when a shared 
area shared by at least two of said partitions is set in any 
of said main memories, an address specified by a 
command for accessing said shared area is subjected in 
said address translation means to mutual translation 
between said local real address used in a partition and 
an address of the shared are used in said network, and 

wherein a main memory of each node is accessed by using 
said local real address no matter whether it is said 
shared area. 

3. A shared memory multiprocessor according to claim 2, 
wherein a command for accessing an area which is not 
shared between said partitions is exchanged between each 
node and the network for interconnecting nodes without 
conducting address translation. 

4. A shared memory multiprocessor according to claim 2, 
wherein when sending out a snoop command from each of 
said plurality of nodes to the inter-node connection network, 
a command for accessing an area which is not shared 
between partitions is multicast to only nodes in the partition, 
and a command for accessing an area shared by partitions is 
broadcasted to all nodes belonging to at least partitions 
sharing the area, 

5. A shared memory multiprocessor according to claim 2, 
wherein in said plurality of partitions, every node included 
in a partition having said shared area set in a main memory 
included in its partition has mapping means for mutually 
mapping between an address in a shared memory space and 
a local real address of the partition of said shared area, 

6. A shared memory multiprocessor according to claim 5, 
wherein said mapping means conducts mapping in accor- 
dance with a table having mapping information set for each 
shared area, and as a result, a plurality of shared areas can 
be set as a whole. 

7. A shared memory multiprocessor according to claim 2, 
wherein in said plurality of partitions, on a local real address 
space of a partition sharing a shared area set in a main 
memory of another partition, a window area for accessing a 
shared area set in the main memory of said another partition 
is provided, said window area does not have a main memory 
in its own partition, and every node in the partition having 
said window area has means for mutually mapping between 
a local real address of the own partition of said window area 
and said shared memory space address of said shared area. 

8. A shared memory multiprocessor according to claim 7, 
wherein it is made possible for two or more partitions to 
share one shared area by permitting mapping between 



0,496 Bl 

24 

window areas respectively set in a plurality of partitions and 
a shared memory space address of one shared area. 

9. A shared memory multiprocessor system having a 
plurality of nodes and a network for interconnecting nodes, 

5 each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 
coherent control being conducted between nodes sharing a 
main memory via said network, 
wherein each of said plurality of nodes comprises infor- 
50 mation addition means, said information addition 
means being used when said plurality of nodes are 
divided into a plurality of partitions each including at 
least one node locally sharing a main memory and a 
shared area shared by two or more of said partitions, 
1 5 said shared area being provided in any of said main 
memories, and said information addition means adds 
information indicating whether a command sent out 
from a node to said network is a command for access- 
ing said shared area or a command for accessing a local 
20 area in said partition which includes said node having 
said information addition means, to the command, 
wherein said local area is addressed by a local address 
local to said partition and said shared area is addressed 
by a shared address of said shared area, and 

25 

wherein when said command is a command for accessing 
said shared area a shared address associated with said 
command is subjected to address translation to a local 
address to address a main memory in which said shared 
30 area is provided. 

10. A shared memory multiprocessor system having a 
plurality of nodes and a network for interconnecting nodes, 
each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 

35 coherent control being conducted between nodes sharing a 
main memory via said network, 
wherein each of said plurality of nodes comprises: 
information addition means used when said plurality of 
nodes are divided into a plurality of partitions, each 

40 including at least one node locally sharing a main 

memory, and a shared memory area shared by two or 
more of said partitions is provided in any of said 
main memories, said information addition means 
adding information, indicating whether a command 

45 sent out from that node to said network is a command 

for accessing said shared area or a command for 
accessing a local area in said partitions, to the 
command, and when said command is a command 
for accessing said shared area, said information 

5q addition means translates an address specified by 

said command from a real address local to a partition 
including its own node to an address in the shared 
area and sends out a resultant address into said 
network, and 

55 means responsive to added information of a command 
received from said network indicating that the com- 
mand is a command accessing said shared area, for 
translating an address of the shared area specified by 
said command to a real address local to a partition 
60 including its own node, 

wherein a main memory of each node is accessed by using 
the real address no matter whether that main memory is 
said shared area. 

11. A shared memory multiprocessor system having a 
65 plurality of nodes and a network for interconnecting nodes, 

each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 



09/22/2003, EAST version: 1.04.0000 



US 6,510,496 Bl 



25 



26 



coherent control being conducted between nodes sharing a 
main memory via said network, 

wherein each of said plurality of nodes comprises: 
information addition means used when said plurality of 
nodes are divided into a plurality of partitions, each 
including at least one node locally sharing a main 
memory, and a shared area shared by two or more of 
said partitions is provided in any of said main 
memories, said information addition means adding 
information, indicating whether a command sent out 
from that node to said network is a command for 
accessing said shared area or a command for access- 
ing a local area in said partitions, to the command, 
storage means for setting nodes included in a partition 

a node belongs to, and 
means responsive to a command received from said 
network being a command for accessing said local 
area, for checking whether the accessing node of the 
command is a node included in a partition that node 
belongs to, and responsive to a negative result, for 
suppressing access of said command, 
wherein said local area is addressed by a local address 
local to said partition and said shared area is addressed 
by a shared address of said shared area, and 
wherein when said command is a command for accessing 
said shared area a shared address associated with said 
command is subjected to address translation to a local 
address to address a main memory in which said shared 
area is provided. 

12. A shared memory multiprocessor system having a 
plurality of nodes and a network for interconnecting nodes, 
each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 
coherent control being conducted between nodes sharing a 
main memory via said network, and when said plurality of 
nodes are divided into a plurality of partitions each including 
at least one node locally sharing a main memory, a command 
for locally accessing a main memory in each of said parti- 
tions and a command for accessing a shared area of a main 
memory shared by a plurality of partitions are used, 

wherein each of said plurality of nodes comprises: 
storage means responsive to determination of partitions 
sharing said shared area, for setting nodes included 
in said partitions, and 
means responsive to a command received from said 
network being a command for accessing the local 
area, for checking whether the accessing node of said 
command is included in the nodes set in said storage 
means, and responsive to a negative result, for sup- 
pressing access of said command, 
wherein said local area is addressed by a local address 
local to said partition and said shared area is addressed 
by a shared address of said shared area, and 
wherein when said command is a command for accessing 
said shared area a shared address associated with said 
command is subjected to address translation to a local 
address to address a main memory in which said shared 
area is provided. 

13. A shared memory multiprocessor system according to 
claim 12, wherein a plurality of said shared areas are set 
independently, and nodes of a set of partitions authorized to 
share each of said plurality of shared areas are independently 
set in said storage means. 

14. In a shared memory multiprocessor system having a 
plurality of nodes and a network for interconnecting nodes, 
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each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 
coherent control being conducted between nodes sharing a 
main memory via said network, and when said plurality of 
nodes are divided into a plurality of partitions each including 
at least one node locally sharing a main memory, a local real 
address local to each of said plurality of partitions is used as 
a memory address of the partition, 

a shared area generation method of a shared memory 
multiprocessor comprising the steps of: 
when generating a shared area shared by two or more 
partitions of said partitions, allocating an area used 
as a shared area to a main memory of a node in one 
partition included in partitions attempting to share 
the shared area, said one partition being own parti- 
tion; and 

defining address mapping to the allocated area and a 
shared area, 

wherein a local area is addressed by a local real address 
local to said partition and said shared area is addressed 
by a shared address of said shared area, and 

wherein when accessing said shared area a shared address 
associated with said accessing is subjected to address 
translation to a local real address to address a main 
memory in which said shared area is provided. 

15. A shared area generation method according to claim 
14, wherein the allocation of an area used as said shared area 
and the definition of said mapping are carried out by system 
software prepared in said one partition. 

16. In a shared memory multiprocessor system having a 
plurality of nodes and a network for interconnecting nodes, 
each of the plurality of nodes including at least one CPU, at 
least one cache, and at least one main memory, cache 
coherent control being conducted between nodes sharing a 
main memory via said network, and when said plurality of 
nodes are divided into a plurality of partitions each including 
at least one node locally sharing a main memory, a local real 
address local to each of said plurality of partitions is used as 
a memory address of the partition, 

a shared area generation method of a shared memory 
multiprocessor comprising the steps of: 
when generating a shared area shared by two or more 
partitions in any of said partitions, allocating a 
window area in a partition in a local real address 
space in the partition, to partitions attempting to 
share the shared area, other than a partition having an 
area used as the shared area allocated in a main 
memory; and 

defining address mapping of the allocated area and a 
shared area, 

wherein a local area is addressed by a local real address 
local to said partition and said shared area is addressed 
by a shared address of said shared area, and 

wherein when accessing said shared area a shared address 
associated with said accessing is subjected to address 
translation to a local real address to address a main 
memory in which said shared area is provided. 

17. A shared area generation method according to claim 
16, wherein the allocation of the window area and the 
definition of mapping are carried out respectively by system 
software pieces respectively prepared in partitions other than 
a partition having an area used as a shared area and allocated 
in a main memory. 
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